Commit graph

603 commits

Author SHA1 Message Date
carlmartin 1d9788e42e [Minor] Improve some code in BroadcastTest for short
Using
    val arr1 = (0 until num).toArray
instead of
    val arr1 = new Array[Int](num)
    for (i <- 0 until arr1.length) {
      arr1(i) = i
    }
for short.

Author: carlmartin <carlmartinmax@gmail.com>

Closes #3750 from SaintBacchus/BroadcastTest and squashes the following commits:

43adb70 [carlmartin] Improve some code in BroadcastTest for short
2014-12-22 12:13:53 -08:00
Ernest a7ed6f3cc5 [SPARK-4880] remove spark.locality.wait in Analytics
spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
Should be left to the user.

Author: Ernest <earneyzxl@gmail.com>

Closes #3730 from Earne/SPARK-4880 and squashes the following commits:

d79ed04 [Ernest] remove spark.locality.wait in Analytics
2014-12-18 15:42:26 -08:00
Kostas Sakellis d6a972b3e4 [SPARK-4774] [SQL] Makes HiveFromSpark more portable
HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed
you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when
the jvm shuts down. This allows us to run this example outside of a spark source tree.

Author: Kostas Sakellis <kostas@cloudera.com>

Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits:

6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable
2014-12-08 15:44:18 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00
Joseph K. Bradley 4ac2151154 [SPARK-4710] [mllib] Eliminate MLlib compilation warnings
Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class.

Added import to DecisionTreeRunner to eliminate warning.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3568 from jkbradley/ml-compilation-warnings and squashes the following commits:

64d6bc4 [Joseph K. Bradley] Updated DecisionTreeRunner.scala and StreamingKMeans.scala to eliminate compilation warnings, including renaming StreamingKMeans to StreamingKMeansExample.
2014-12-03 18:50:03 +08:00
wangfei 7b79957879 [SQL] Minor fix for doc and comment
Author: wangfei <wangfei1@huawei.com>

Closes #3533 from scwf/sql-doc1 and squashes the following commits:

962910b [wangfei] doc and comment fix
2014-12-01 14:02:02 -08:00
Sean Owen 5d7fe178b3 SPARK-4170 [CORE] Closure problems when running Scala app that "extends App"
Warn against subclassing scala.App, and remove one instance of this in examples

Author: Sean Owen <sowen@cloudera.com>

Closes #3497 from srowen/SPARK-4170 and squashes the following commits:

4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
2014-11-27 09:03:17 -08:00
q00251598 a51118a34a [SPARK-4535][Streaming] Fix the error in comments
change `NetworkInputDStream` to `ReceiverInputDStream`
change `ReceiverInputTracker` to `ReceiverTracker`

Author: q00251598 <qiyadong@huawei.com>

Closes #3400 from watermen/fix-comments and squashes the following commits:

75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker'
2014-11-25 04:01:56 -08:00
scwf b384119304 [SQL] Fix path in HiveFromSpark
It require us to run ```HiveFromSpark``` in specified dir because ```HiveFromSpark``` use relative path, this leads to ```run-example``` error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html).

Author: scwf <wangfei1@huawei.com>

Closes #3415 from scwf/HiveFromSpark and squashes the following commits:

ed3d6c9 [scwf] revert no need change
b00e20c [scwf] fix path usring spark_home
dbd321b [scwf] fix path in hivefromspark
2014-11-24 12:49:08 -08:00
Xiangrui Meng 15cacc8124 [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc
There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.

1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
1. GradientBoosting -> GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates

manishamde jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:

7097251 [Xiangrui Meng] address joseph's comments
98dea09 [Xiangrui Meng] address manish's comments
4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
ea4c467 [Xiangrui Meng] fix unit tests
751da4e [Xiangrui Meng] rename class method train -> run
19030a5 [Xiangrui Meng] update boosting public APIs
2014-11-20 00:48:59 -08:00
tedyu 5f5ac2dafa SPARK-4455 Exclude dependency on hbase-annotations module
pwendell
Please take a look

Author: tedyu <yuzhihong@gmail.com>

Closes #3286 from tedyu/master and squashes the following commits:

e61e610 [tedyu] SPARK-4455 Exclude dependency on hbase-annotations module
7e3a57a [tedyu] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark
2f28b08 [tedyu] Exclude dependency on hbase-annotations module
2014-11-19 00:55:39 -08:00
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
Adam Pingel e7690ed20a SPARK-2811 upgrade algebird to 0.8.1
Author: Adam Pingel <adam@axle-lang.org>

Closes #3282 from adampingel/master and squashes the following commits:

70c8d3c [Adam Pingel] relocate the algebird example back to example/src
7a9d8be [Adam Pingel] SPARK-2811 upgrade algebird to 0.8.1
2014-11-17 10:47:29 -08:00
Josh Rosen 40eb8b6ef3 [SPARK-2321] Several progress API improvements / refactorings
This PR refactors / extends the status API introduced in #2696.

- Change StatusAPI from a mixin trait to a class.  Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field.  As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example).
- Change the name from SparkStatusAPI to SparkStatusTracker.
- Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group.
- Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext.  This should simplify davies's progress bar code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits:

30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker.
d1b08d8 [Josh Rosen] Add missing newlines
2cc7353 [Josh Rosen] Add missing file.
d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods.
a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group
c47e294 [Josh Rosen] Remove StatusAPI mixin trait.
2014-11-14 23:46:25 -08:00
Sandy Ryza f5f757e4ed SPARK-4375. no longer require -Pscala-2.10
It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits:

0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
cd42d94 [Sandy Ryza] Update doc
f6644c3 [Sandy Ryza] SPARK-4375 take 2
2014-11-14 14:21:57 -08:00
Xiangrui Meng 32218307ed [SPARK-4372][MLLIB] Make LR and SVM's default parameters consistent in Scala and Python
The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy.

In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default.

davies atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #3232 from mengxr/SPARK-4372 and squashes the following commits:

9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments
d3ba096 [Xiangrui Meng] change 'none' back to None
1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM
2014-11-13 13:54:16 -08:00
Soumitra Kumar 36ddeb7bf8 [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
SPARK-3660 : Initial RDD for updateStateByKey transformation

I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.

Please let me know if any changes are required.

Author: Soumitra Kumar <kumar.soumitra@gmail.com>

Closes #2665 from soumitrak/master and squashes the following commits:

ee8980b [Soumitra Kumar] Fixed copy/paste issue.
304f636 [Soumitra Kumar] Added simpler version of updateStateByKey API with initialRDD and test.
9781135 [Soumitra Kumar] Fixed test, and renamed variable.
3da51a2 [Soumitra Kumar] Adding updateStateByKey with initialRDD API to JavaPairDStream.
2f78f7e [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
d4fdd18 [Soumitra Kumar] Renamed variable and moved method.
d0ce2cd [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
31399a4 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
4efa58b [Soumitra Kumar] [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
8f40ca0 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
dde4271 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
fdd7db3 [Soumitra Kumar] Adding support of initial value for state update. SPARK-3660 : Initial RDD for updateStateByKey transformation
2014-11-12 12:25:31 -08:00
Xiangrui Meng 4b736dbab3 [SPARK-3530][MLLIB] pipeline and parameters with examples
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from  marmbrus and liancheng on the Spark SQL side. The design doc can be found at:

https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

**org.apache.spark.ml**

This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:

1. Transformer, which transforms a dataset into another
2. Estimator, which fits models to data, where models are transformers
3. Evaluator, which evaluates model output and returns a scalar metric
4. Pipeline, a simple pipeline that consists of transformers and estimators

Parameters could be supplied at fit/transform or embedded with components.

1. Param: a strong-typed parameter key with self-contained doc
2. ParamMap: a param -> value map
3. Params: trait for components with parameters

For any component that implements `Params`, user can easily check the doc by calling `explainParams`:

~~~
> val lr = new LogisticRegression
> lr.explainParams
maxIter: max number of iterations (default: 100)
regParam: regularization constant (default: 0.1)
labelCol: label column name (default: label)
featuresCol: features column name (default: features)
~~~

or user can check individual param:

~~~
> lr.maxIter
maxIter: max number of iterations (default: 100)
~~~

**Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:**

1. run a simple logistic regression job

~~~
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(1.0)
    val model = lr.fit(dataset)
    model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
      .select('label, 'score, 'prediction).collect()
      .foreach(println)
~~~

2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric

~~~
    val lr = new LogisticRegression
    val lrParamMaps = new ParamGridBuilder()
      .addGrid(lr.regParam, Array(0.1, 100.0))
      .addGrid(lr.maxIter, Array(0, 5))
      .build()
    val eval = new BinaryClassificationEvaluator
    val cv = new CrossValidator()
      .setEstimator(lr)
      .setEstimatorParamMaps(lrParamMaps)
      .setEvaluator(eval)
      .setNumFolds(3)
    val bestModel = cv.fit(dataset)
~~~

3. run a pipeline that consists of a standard scaler and a logistic regression component

~~~
    val scaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledFeatures")
    val lr = new LogisticRegression()
      .setFeaturesCol(scaler.getOutputCol)
    val pipeline = new Pipeline()
      .setStages(Array(scaler, lr))
    val model = pipeline.fit(dataset)
    val predictions = model.transform(dataset)
      .select('label, 'score, 'prediction)
      .collect()
      .foreach(println)
~~~

4. a simple text classification pipeline, which recognizes "spark":

~~~
    val training = sparkContext.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))
    val model = pipeline.fit(training)
    val test = sparkContext.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))
    model.transform(test)
      .select('id, 'text, 'prediction, 'score)
      .collect()
      .foreach(println)
~~~

Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`.

**What are missing now and will be added soon:**

1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~
2. ~~Java examples.~~
3. ~~Store training parameters in trained models.~~
4. (later) Serialization and Python API.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3099 from mengxr/SPARK-3530 and squashes the following commits:

2cc93fd [Xiangrui Meng] hide APIs as much as I can
34319ba [Xiangrui Meng] use local instead local[2] for unit tests
2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema
c9daab4 [Xiangrui Meng] remove mockito version
1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext
6ffc389 [Xiangrui Meng] try to fix unit test
a59d8b7 [Xiangrui Meng] doc updates
977fd9d [Xiangrui Meng] add scala ml package object
6d97fe6 [Xiangrui Meng] add AlphaComponent annotation
731f0e4 [Xiangrui Meng] update package doc
0435076 [Xiangrui Meng] remove ;this from setters
fa21d9b [Xiangrui Meng] update extends indentation
f1091b3 [Xiangrui Meng] typo
228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics
f51cd27 [Xiangrui Meng] rename default to defaultValue
b3be094 [Xiangrui Meng] refactor schema transform in lr
8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing
51f1c06 [Xiangrui Meng] remove leftover code in Transformer
494b632 [Xiangrui Meng] compure score once
ad678e9 [Xiangrui Meng] more doc for Transformer
4306ed4 [Xiangrui Meng] org imports in text pipeline
6e7c1c7 [Xiangrui Meng] update pipeline
4f9e34f [Xiangrui Meng] more doc for pipeline
aa5dbd4 [Xiangrui Meng] fix typo
11be383 [Xiangrui Meng] fix unit tests
3df7952 [Xiangrui Meng] clean up
986593e [Xiangrui Meng] re-org java test suites
2b11211 [Xiangrui Meng] remove external data deps
9fd4933 [Xiangrui Meng] add unit test for pipeline
2a0df46 [Xiangrui Meng] update tests
2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info
27582a4 [Xiangrui Meng] doc changes
73a000b [Xiangrui Meng] add schema transformation layer
6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait
80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer
62ca2bb [Xiangrui Meng] check param parent in set/get
1622349 [Xiangrui Meng] add getModel to PipelineModel
a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer
d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap
c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite
e246f29 [Xiangrui Meng] re-org:
7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline
b95c408 [Xiangrui Meng] remove implicits add unit tests to params
bab3e5b [Xiangrui Meng] update params
fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
6e86d98 [Xiangrui Meng] some code clean-up
2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip]
fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform
3f810cd [Xiangrui Meng] use multi-model training api in cv
5b8f413 [Xiangrui Meng] rename model to modelParams
9d2d35d [Xiangrui Meng] test varargs and chain model params
f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
1ef26e0 [Xiangrui Meng] specialize methods/types for Java
df293ed [Xiangrui Meng] switch to setter/getter
376db0a [Xiangrui Meng] pipeline and parameters
2014-11-12 10:38:57 -08:00
Prashant Sharma daaca14c16 Support cross building for Scala 2.11
Let's give this another go using a version of Hive that shades its JLine dependency.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:

e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
2014-11-11 21:36:48 -08:00
Varadharajan Mukundan 974d334cf0 [SPARK-4047] - Generate runtime warnings for example implementation of PageRank
Based on SPARK-2434, this PR generates runtime warnings for example implementations (Python, Scala) of PageRank.

Author: Varadharajan Mukundan <srinathsmn@gmail.com>

Closes #2894 from varadharajan/SPARK-4047 and squashes the following commits:

5f9406b [Varadharajan Mukundan] [SPARK-4047] - Point users to LogisticRegressionWithSGD and LogisticRegressionWithLBFGS instead of LogisticRegressionModel
252f595 [Varadharajan Mukundan] a. Generate runtime warnings for
05a018b [Varadharajan Mukundan] Fix PageRank implementation's package reference
5c2bf54 [Varadharajan Mukundan] [SPARK-4047] - Generate runtime warnings for example implementation of PageRank
2014-11-10 14:32:29 -08:00
tedyu b32734e12d SPARK-1297 Upgrade HBase dependency to 0.98
pwendell rxin
Please take a look

Author: tedyu <yuzhihong@gmail.com>

Closes #3115 from tedyu/master and squashes the following commits:

2b079c8 [tedyu] SPARK-1297 Upgrade HBase dependency to 0.98
2014-11-10 13:23:33 -08:00
comcmipi 0340c56a92 Update RecoverableNetworkWordCount.scala
Trying this example, I missed the moment when the checkpoint was iniciated

Author: comcmipi <pitonak@fns.uniba.sk>

Closes #2735 from comcmipi/patch-1 and squashes the following commits:

b6d8001 [comcmipi] Update RecoverableNetworkWordCount.scala
96fe274 [comcmipi] Update RecoverableNetworkWordCount.scala
2014-11-10 12:33:48 -08:00
Sean Owen 3a02d416cd SPARK-2548 [STREAMING] JavaRecoverableWordCount is missing
Here's my attempt to re-port `RecoverableNetworkWordCount` to Java, following the example of its Scala and Java siblings. I fixed a few minor doc/formatting issues along the way I believe.

Author: Sean Owen <sowen@cloudera.com>

Closes #2564 from srowen/SPARK-2548 and squashes the following commits:

0d0bf29 [Sean Owen] Update checkpoint call as in https://github.com/apache/spark/pull/2735
35f23e3 [Sean Owen] Remove old comment about running in standalone mode
179b3c2 [Sean Owen] Re-port RecoverableNetworkWordCount to Java example, and touch up doc / formatting in related examples
2014-11-10 11:47:27 -08:00
xiao321 7c9ec529a3 Update JavaCustomReceiver.java
数组下标越界

Author: xiao321 <1042460381@qq.com>

Closes #3153 from xiao321/patch-1 and squashes the following commits:

0ed17b5 [xiao321] Update JavaCustomReceiver.java
2014-11-07 12:56:49 -08:00
Joseph K. Bradley c315d1316c [SPARK-4254] [mllib] MovieLensALS bug fix
Changed code so it does not try to serialize Params.
CC: mengxr 	debasish83 srowen

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3116 from jkbradley/als-bugfix and squashes the following commits:

e575bd8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into als-bugfix
9401b16 [Joseph K. Bradley] changed implicitPrefs so it is not serialized to fix MovieLensALS example bug
2014-11-05 19:51:18 -08:00
Joseph K. Bradley 5b3b6f6f5f [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java
### Summary

* Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
* Added Scala and Java examples for GradientBoostedTrees
* small cleanups and fixes

### Details

GradientBoosting bug fixes (“bug” = bad default options)
* Force boostingStrategy.weakLearnerParams.algo = Regression
* Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
* Only persist data if not yet persisted (since it causes an error if persisted twice)

BoostingStrategy
* numEstimators: renamed to numIterations
* removed subsamplingRate (duplicated by Strategy)
* removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added assertValid() method
* Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy

Strategy (for DecisionTree)
* Changed algo to var (not val) and added BeanProperty, with overload taking String argument
* Added setCategoricalFeaturesInfo method taking Java Map.
* Cleaned up assertValid
* Changed val’s to def’s since parameters can now be changed.

CC: manishamde mengxr codedeft

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3094 from jkbradley/gbt-api and squashes the following commits:

7a27e22 [Joseph K. Bradley] scalastyle fix
52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
e9b8410 [Joseph K. Bradley] Summary of changes
2014-11-05 10:33:13 -08:00
Xiangrui Meng 1a9c6cddad [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

marmbrus jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:

3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
2014-11-03 22:29:48 -08:00
Sung Chung 56f2c61cde [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci...
...sion trees. jkbradley mengxr chouqin Please review this.

Author: Sung Chung <schung@alpinenow.com>

Closes #2868 from codedeft/SPARK-3161 and squashes the following commits:

5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
2014-11-01 16:58:26 -07:00
Joseph E. Gonzalez f4e0b28c85 [SPARK-4142][GraphX] Default numEdgePartitions
Changing the default number of edge partitions to match spark parallelism.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #3006 from jegonzal/default_partitions and squashes the following commits:

a9a5c4f [Joseph E. Gonzalez] Changing the default number of edge partitions to match spark parallelism
2014-11-01 01:18:07 -07:00
freeman 98c556ebbc Streaming KMeans [MLLIB][SPARK-3254]
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors

tdas mengxr rezazadeh

Author: freeman <the.freeman.lab@gmail.com>
Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits:

b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay
2014-10-31 22:30:12 -07:00
Manish Amde 8602195510 [MLLIB] SPARK-1547: Add Gradient Boosting to MLlib
Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work.

Ideally, boosting algorithms should work with any base learners.  This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default.

Here is the task list:
- [x] Gradient boosting support
- [x] Pluggable loss functions
- [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest.
- [x] Binary classification support
- [x] Support configurable checkpointing – This approach will avoid long lineage chains.
- [x] Create classification and regression APIs
- [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting.
- [x] Unit Tests

Future work:
+ Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function.
+ BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed.

cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>

Closes #2607 from manishamde/gbt and squashes the following commits:

991c7b5 [Manish Amde] public api
ff2a796 [Manish Amde] addressing comments
b4c1318 [Manish Amde] removing spaces
8476b6b [Manish Amde] fixing line length
0183cb9 [Manish Amde] fixed naming and formatting issues
1c40c33 [Manish Amde] add newline, removed spaces
e33ab61 [Manish Amde] minor comment
eadbf09 [Manish Amde] parameter renaming
035a2ed [Manish Amde] jkbradley formatting suggestions
9f7359d [Manish Amde] simplified gbt logic and added more tests
49ba107 [Manish Amde] merged from master
eff21fe [Manish Amde] Added gradient boosting tests
3fd0528 [Manish Amde] moved helper methods to new class
a32a5ab [Manish Amde] added test for subsampling without replacement
781542a [Manish Amde] added support for fractional subsampling with replacement
3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite
0e81906 [Manish Amde] improving caching unpersisting logic
d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class
fee06d3 [Manish Amde] added weighted ensemble model
1b01943 [Manish Amde] add weights for base learners
9bc6e74 [Manish Amde] adding random seed as parameter
d2c8323 [Manish Amde] Merge branch 'master' into gbt
2ae97b7 [Manish Amde] added documentation for the loss classes
9366b8f [Manish Amde] minor: using numTrees instead of trees.size
3b43896 [Manish Amde] added learning rate for prediction
9b2e35e [Manish Amde] Merge branch 'master' into gbt
6a11c02 [manishamde] fixing formatting
823691b [Manish Amde] fixing RF test
1f47941 [Manish Amde] changing access modifier
5b67102 [Manish Amde] shortened parameter list
5ab3796 [Manish Amde] minor reformatting
9155a9d [Manish Amde] consolidated boosting configuration and added public API
631baea [Manish Amde] Merge branch 'master' into gbt
2cb1258 [Manish Amde] public API support
3b8ffc0 [Manish Amde] added documentation
8e10c63 [Manish Amde] modified unpersist strategy
f62bc48 [Manish Amde] added unpersist
bdca43a [Manish Amde] added timing parameters
2fbc9c7 [Manish Amde] fixing binomial classification prediction
6dd4dd8 [Manish Amde] added support for log loss
9af0231 [Manish Amde] classification attempt
62cc000 [Manish Amde] basic checkpointing
4784091 [Manish Amde] formatting
78ed452 [Manish Amde] added newline and fixed if statement
3973dd1 [Manish Amde] minor indicating subsample is double during comparison
aa8fae7 [Manish Amde] minor refactoring
1a8031c [Manish Amde] sampling with replacement
f1c9ef7 [Manish Amde] Merge branch 'master' into gbt
cdceeef [Manish Amde] added documentation
6251fd5 [Manish Amde] modified method name
5538521 [Manish Amde] disable checkpointing for now
0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches
2014-10-31 18:57:55 -07:00
Anant e07fb6a41e [SPARK-3838][examples][mllib][python] Word2Vec example in python
This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838

Python example for word2vec
mengxr

Author: Anant <anant.asty@gmail.com>

Closes #2952 from anantasty/SPARK-3838 and squashes the following commits:

87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec
2014-10-31 18:33:19 -07:00
Sean Owen bfa614b127 SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.

Author: Sean Owen <sowen@cloudera.com>

Closes #2928 from srowen/SPARK-4022 and squashes the following commits:

61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
a1a78e0 [Sean Owen] Use Well19937c
31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
5c9c67f [Sean Owen] Additional test fixes from review
d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
2014-10-27 10:53:15 -07:00
anant asthana 677852c3fa Just fixing comment that shows usage
Author: anant asthana <anant.asty@gmail.com>

Closes #2948 from anantasty/patch-1 and squashes the following commits:

d8fea0b [anant asthana] Just fixing comment that shows usage
2014-10-26 14:14:12 -07:00
Josh Rosen 9530316887 [SPARK-2321] Stable pull-based progress / status API
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)).  For now, I'd like to discuss the basic implementation, API names, and overall interface design.  Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.

#### Design goals:

- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.

#### Implementation:

- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values.  These interfaces consist entirely of Java-style getter methods.  The interfaces are currently implemented in Java.  I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI.  This allows us to re-use this listeners in the implementation of this status API.  There are a few reasons why this listener re-use makes sense:
   - The status API and web UI are guaranteed to show consistent information.
   - These listeners are already well-tested.
   - The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.

The progress API methods are implemented in a separate trait that's mixed into SparkContext.  This helps to avoid SparkContext.scala from becoming larger and more difficult to read.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>

Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:

e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
2014-10-25 00:06:57 -07:00
Grace 07e439b4fe [GraphX] Modify option name according to example doc in SynthBenchmark
Now graphx.SynthBenchmark example has an option of iteration number named as "niter". However, in its document, it is named as "niters". The mismatch between the implementation and document causes certain IllegalArgumentException while trying that example.

Author: Grace <jie.huang@intel.com>

Closes #2888 from GraceH/synthbenchmark and squashes the following commits:

f101ee1 [Grace] Modify option name according to example doc
2014-10-24 13:48:08 -07:00
Kousuke Saruta f799700eec [SPARK-4055][MLlib] Inconsistent spelling 'MLlib' and 'MLLib'
Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2903 from sarutak/SPARK-4055 and squashes the following commits:

b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
2014-10-23 09:19:32 -07:00
Karthik bae4ca3bbf Update JavaCustomReceiver.java
Changed the usage string to correctly reflect the file name.

Author: Karthik <karthik.gomadam@gmail.com>

Closes #2699 from namelessnerd/patch-1 and squashes the following commits:

8570e33 [Karthik] Update JavaCustomReceiver.java
2014-10-22 00:08:53 -07:00
Sandy Ryza 6bb56faea8 SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
Author: Sandy Ryza <sandy@cloudera.com>

Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:

48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
2014-10-21 21:53:09 -07:00
Davies Liu 05db2da7dc [SPARK-3952] [Streaming] [PySpark] add Python examples in Streaming Programming Guide
Having Python examples in Streaming Programming Guide.

Also add RecoverableNetworkWordCount example.

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2808 from davies/pyguide and squashes the following commits:

8d4bec4 [Davies Liu] update readme
26a7e37 [Davies Liu] fix format
3821c4d [Davies Liu] address comments, add missing file
7e4bb8a [Davies Liu] add Python examples in Streaming Programming Guide
2014-10-18 19:14:48 -07:00
Joseph K. Bradley 477c6481cc [SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree
SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered.
Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset).

Added test to RandomForestSuite which failed with old version but now works.

SPARK-3918: Added baggedInput.unpersist at end of training.

Also:
* I removed DTStatsAggregator.isUnordered since it is no longer used.
* DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced.
* Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data.  This is a temporary fix which should eventually be replaced by pre-indexing both datasets.
* RandomForestModel: Updated toString to print total number of nodes in forest.
* Changed Predict class to be public DeveloperApi.  This was necessary to allow users to create their own trees by hand (for testing).

CC: mengxr  manishamde chouqin codedeft  Just notifying you of these small bug fixes.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2785 from jkbradley/dtrunner-update and squashes the following commits:

9132321 [Joseph K. Bradley] merged with master, fixed imports
9dbd000 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
e116473 [Joseph K. Bradley] Changed Predict class to be public DeveloperApi.
f502e65 [Joseph K. Bradley] bug fix for SPARK-3934
7f3d60f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
ba567ab [Joseph K. Bradley] Changed DTRunner to load test data using same number of features as in training data.
4e88c1f [Joseph K. Bradley] changed RF toString to print total number of nodes
2014-10-17 15:02:57 -07:00
Daoyuan Wang 23f6171d63 [SPARK-3985] [Examples] fix file path using os.path.join
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2834 from adrian-wang/sqlpypath and squashes the following commits:

da7aa95 [Daoyuan Wang] fix file path using path.join
2014-10-17 14:49:44 -07:00
Kun Li be2ec4a91d [SQL]typo in HiveFromSpark
Author: Kun Li <jacky.likun@gmail.com>

Closes #2809 from jackylk/patch-1 and squashes the following commits:

46c926b [Kun Li] typo in HiveFromSpark
2014-10-16 19:00:10 -07:00
NamelessAnalyst e5be4de7bc SPARK-3716 [GraphX] Update Analytics.scala for partitionStrategy assignment
Previously, when the val partitionStrategy was created it called a function in the Analytics object which was a copy of the PartitionStrategy.fromString() method. This function has been removed, and the assignment of partitionStrategy now uses the PartitionStrategy.fromString method instead. In this way, it better matches the declarations of edge/vertex StorageLevel variables.

Author: NamelessAnalyst <NamelessAnalyst@users.noreply.github.com>

Closes #2569 from NamelessAnalyst/branch-1.1 and squashes the following commits:

c24ff51 [NamelessAnalyst] Update Analytics.scala

(cherry picked from commit 5a21e3e7e9)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
2014-10-12 14:19:11 -07:00
giwa 69c67abaa9 [SPARK-2377] Python API for Streaming
This patch brings Python API for Streaming.

This patch is based on work from @giwa

Author: giwa <ugw.gi.world@gmail.com>
Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local>
Author: Davies Liu <davies.liu@gmail.com>
Author: Ken Takagiwa <ken@kens-mbp.gateway.sonic.net>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Ken <ugw.gi.world@gmail.com>
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Author: Matthew Farrellee <matt@redhat.com>

Closes #2538 from davies/streaming and squashes the following commits:

64561e4 [Davies Liu] fix tests
331ecce [Davies Liu] fix example
3e2492b [Davies Liu] change updateStateByKey() to easy API
182be73 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
02d0575 [Davies Liu] add wrapper for foreachRDD()
bebeb4a [Davies Liu] address all comments
6db00da [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
8380064 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
52c535b [Davies Liu] remove fix for sum()
e108ec1 [Davies Liu]  address comments
37fe06f [Davies Liu] use random port for callback server
d05871e [Davies Liu] remove reuse of PythonRDD
be5e5ff [Davies Liu] merge branch of env, make tests stable.
8071541 [Davies Liu] Merge branch 'env' into streaming
c7bbbce [Davies Liu] fix sphinx docs
6bb9d91 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
54bd92b [Davies Liu] improve tests
c2b31cb [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
7a88f9f [Davies Liu] rollback RDD.setContext(), use textFileStream() to test checkpointing
bd8a4c2 [Davies Liu] fix scala style
7797c70 [Davies Liu] refactor
ff88bec [Davies Liu] rename RDDFunction to TransformFunction
d328aca [Davies Liu] fix serializer in queueStream
6f0da2f [Davies Liu] recover from checkpoint
fa7261b [Davies Liu] refactor
a13ff34 [Davies Liu] address comments
8466916 [Davies Liu] support checkpoint
9a16bd1 [Davies Liu] change number of partitions during tests
b98d63f [Davies Liu] change private[spark] to private[python]
eed6e2a [Davies Liu] rollback not needed changes
e00136b [Davies Liu] address comments
069a94c [Davies Liu] fix the number of partitions during window()
338580a [Davies Liu] change _first(), _take(), _collect() as private API
19797f9 [Davies Liu] clean up
6ebceca [Davies Liu] add more tests
c40c52d [Davies Liu] change first(), take(n) to has the same behavior as RDD
98ac6c2 [Davies Liu] support ssc.transform()
b983f0f [Davies Liu] address comments
847f9b9 [Davies Liu] add more docs, add first(), take()
e059ca2 [Davies Liu] move check of window into Python
fce0ef5 [Davies Liu] rafactor of foreachRDD()
7001b51 [Davies Liu] refactor of queueStream()
26ea396 [Davies Liu] refactor
74df565 [Davies Liu] fix print and docs
b32774c [Davies Liu] move java_import into streaming
604323f [Davies Liu] enable streaming tests
c499ba0 [Davies Liu] remove Time and Duration
3f0fb4b [Davies Liu] refactor fix tests
c28f520 [Davies Liu] support updateStateByKey
d357b70 [Davies Liu] support windowed dstream
bd13026 [Davies Liu] fix examples
eec401e [Davies Liu] refactor, combine TransformedRDD, fix reuse PythonRDD, fix union
9a57685 [Davies Liu] fix python style
bd27874 [Davies Liu] fix scala style
7339be0 [Davies Liu] delete tests
7f53086 [Davies Liu] support transform(), refactor and cleanup
df098fc [Davies Liu] Merge branch 'master' into giwa
550dfd9 [giwa] WIP fixing 1.1 merge
5cdb6fa [giwa] changed for SCCallSiteSync
e685853 [giwa] meged with rebased 1.1 branch
2d32a74 [giwa] added some StreamingContextTestSuite
4a59e1e [giwa] WIP:added more test for StreamingContext
8ffdbf1 [giwa] added atexit to handle callback server
d5f5fcb [giwa] added comment for StreamingContext.sparkContext
63c881a [giwa] added StreamingContext.sparkContext
d39f102 [giwa] added StreamingContext.remember
d542743 [giwa] clean up code
2fdf0de [Matthew Farrellee] Fix scalastyle errors
c0a06bc [giwa] delete not implemented functions
f385976 [giwa] delete inproper comments
b0f2015 [giwa] added comment in dstream._test_output
bebb3f3 [giwa] remove the last brank line
fbed8da [giwa] revert pom.xml
8ed93af [giwa] fixed explanaiton
066ba90 [giwa] revert pom.xml
fa4af88 [giwa] remove duplicated import
6ae3caa [giwa] revert pom.xml
7dc7391 [giwa] fixed typo
62dc7a3 [giwa] clean up exmples
f04882c [giwa] clen up examples
b171ec3 [giwa] fixed pep8 violation
f198d14 [giwa] clean up code
3166d31 [giwa] clean up
c00e091 [giwa] change test case not to use awaitTermination
e80647e [giwa] adopted the latest compression way of python command
58e41ff [giwa] merge with master
455e5af [giwa] removed wasted print in DStream
af336b7 [giwa] add comments
ddd4ee1 [giwa] added TODO coments
99ce042 [giwa] added saveAsTextFiles and saveAsPickledFiles
2a06cdb [giwa] remove waste duplicated code
c5ecfc1 [giwa] basic function test cases are passed
8dcda84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
795b2cd [giwa] broke something
1e126bf [giwa] WIP: solved partitioned and None is not recognized
f67cf57 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
953deb0 [giwa] edited the comment to add more precise description
af610d3 [giwa] removed unnesessary changes
c1d546e [giwa] fixed PEP-008 violation
99410be [giwa] delete waste file
b3b0362 [giwa] added basic operation test cases
9cde7c9 [giwa] WIP added test case
bd3ba53 [giwa] WIP
5c04a5f [giwa] WIP: added PythonTestInputStream
019ef38 [giwa] WIP
1934726 [giwa] update comment
376e3ac [giwa] WIP
932372a [giwa] clean up dstream.py
0b09cff [giwa] added stop in StreamingContext
92e333e [giwa] implemented reduce and count function in Dstream
1b83354 [giwa] Removed the waste line
88f7506 [Ken Takagiwa] Kill py4j callback server properly
54b5358 [Ken Takagiwa] tried to restart callback server
4f07163 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
fe02547 [Ken Takagiwa] remove waste file
2ad7bd3 [Ken Takagiwa] clean up codes
6197a11 [Ken Takagiwa] clean up code
eb4bf48 [Ken Takagiwa] fix map function
98c2a00 [Ken Takagiwa] added count operation but this implementation need double check
58591d2 [Ken Takagiwa] reduceByKey is working
0df7111 [Ken Takagiwa] delete old file
f485b1d [Ken Takagiwa] fied input of socketTextDStream
dd6de81 [Ken Takagiwa] initial commit for socketTextStream
247fd74 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
4bcb318 [Ken Takagiwa] implementing transform function in Python
38adf95 [Ken Takagiwa] added reducedByKey not working yet
66fcfff [Ken Takagiwa] modify dstream.py to fix indent error
41886c2 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
0b99bec [Ken] initial commit for pySparkStreaming
c214199 [giwa] added testcase for combineByKey
5625bdc [giwa] added gorupByKey testcase
10ab87b [giwa] added sparkContext as input parameter in StreamingContext
10b5b04 [giwa] removed wasted print in DStream
e54f986 [giwa] add comments
16aa64f [giwa] added TODO coments
74535d4 [giwa] added saveAsTextFiles and saveAsPickledFiles
f76c182 [giwa] remove waste duplicated code
18c8723 [giwa] modified streaming test case to add coment
13fb44c [giwa] basic function test cases are passed
3000b2b [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
ff14070 [giwa] broke something
bcdec33 [giwa] WIP: solved partitioned and None is not recognized
270a9e1 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
bb10956 [giwa] edited the comment to add more precise description
253a863 [giwa] removed unnesessary changes
3d37822 [giwa] fixed PEP-008 violation
f21cab3 [giwa] delete waste file
878bad7 [giwa] added basic operation test cases
ce2acd2 [giwa] WIP added test case
9ad6855 [giwa] WIP
1df77f5 [giwa] WIP: added PythonTestInputStream
1523b66 [giwa] WIP
8a0fbbc [giwa] update comment
fe648e3 [giwa] WIP
29c2bc5 [giwa] initial commit for testcase
4d40d63 [giwa] clean up dstream.py
c462bb3 [giwa] added stop in StreamingContext
d2c01ba [giwa] clean up examples
3c45cd2 [giwa] implemented reduce and count function in Dstream
b349649 [giwa] Removed the waste line
3b498e1 [Ken Takagiwa] Kill py4j callback server properly
84a9668 [Ken Takagiwa] tried to restart callback server
9ab8952 [Tathagata Das] Added extra line.
05e991b [Tathagata Das] Added missing file
b1d2a30 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
678e854 [Ken Takagiwa] remove waste file
0a8bbbb [Ken Takagiwa] clean up codes
bab31c1 [Ken Takagiwa] clean up code
72b9738 [Ken Takagiwa] fix map function
d3ee86a [Ken Takagiwa] added count operation but this implementation need double check
15feea9 [Ken Takagiwa] edit python sparkstreaming example
6f98e50 [Ken Takagiwa] reduceByKey is working
c455c8d [Ken Takagiwa] added reducedByKey not working yet
dc6995d [Ken Takagiwa] delete old file
b31446a [Ken Takagiwa] fixed typo of network_workdcount.py
ccfd214 [Ken Takagiwa] added doctest for pyspark.streaming.duration
0d1b954 [Ken Takagiwa] fied input of socketTextDStream
f746109 [Ken Takagiwa] initial commit for socketTextStream
bb7ccf3 [Ken Takagiwa] remove unused import in python
224fc5e [Ken Takagiwa] add empty line
d2099d8 [Ken Takagiwa] sorted the import following Spark coding convention
5bac7ec [Ken Takagiwa] revert streaming/pom.xml
e1df940 [Ken Takagiwa] revert pom.xml
494cae5 [Ken Takagiwa] remove not implemented DStream functions in python
17a74c6 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
1a0f065 [Ken Takagiwa] implementing transform function in Python
d7b4d6f [Ken Takagiwa] added reducedByKey not working yet
87438e2 [Ken Takagiwa] modify dstream.py to fix indent error
b406252 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
454981d [Ken] initial commit for pySparkStreaming
150b94c [giwa] added some StreamingContextTestSuite
f7bc8f9 [giwa] WIP:added more test for StreamingContext
ee50c5a [giwa] added atexit to handle callback server
fdc9125 [giwa] added comment for StreamingContext.sparkContext
f5bfb70 [giwa] added StreamingContext.sparkContext
da09768 [giwa] added StreamingContext.remember
d68b568 [giwa] clean up code
4afa390 [giwa] clean up code
1fd6bc7 [Ken Takagiwa] Merge pull request #2 from mattf/giwa-master
d9d59fe [Matthew Farrellee] Fix scalastyle errors
67473a9 [giwa] delete not implemented functions
c97377c [giwa] delete inproper comments
2ea769e [giwa] added comment in dstream._test_output
3b27bd4 [giwa] remove the last brank line
acfcaeb [giwa] revert pom.xml
93f7637 [giwa] fixed explanaiton
50fd6f9 [giwa] revert pom.xml
4f82c89 [giwa] remove duplicated import
9d1de23 [giwa] revert pom.xml
7339df2 [giwa] fixed typo
9c85e48 [giwa] clean up exmples
24f95db [giwa] clen up examples
0d30109 [giwa] fixed pep8 violation
b7dab85 [giwa] improve test case
583e66d [giwa] move tests for streaming inside streaming directory
1d84142 [giwa] remove unimplement test
f0ea311 [giwa] clean up code
171edeb [giwa] clean up
4dedd2d [giwa] change test case not to use awaitTermination
268a6a5 [giwa] Changed awaitTermination not to call awaitTermincation in Scala. Just use time.sleep instread
09a28bf [giwa] improve testcases
58150f5 [giwa] Changed the test case to focus the test operation
199e37f [giwa] adopted the latest compression way of python command
185fdbf [giwa] merge with master
f1798c4 [giwa] merge with master
e70f706 [giwa] added testcase for combineByKey
e162822 [giwa] added gorupByKey testcase
97742fe [giwa] added sparkContext as input parameter in StreamingContext
14d4c0e [giwa] removed wasted print in DStream
6d8190a [giwa] add comments
4aa99e4 [giwa] added TODO coments
e9fab72 [giwa] added saveAsTextFiles and saveAsPickledFiles
94f2b65 [giwa] remove waste duplicated code
580fbc2 [giwa] modified streaming test case to add coment
99e4bb3 [giwa] basic function test cases are passed
7051a84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
35933e1 [giwa] broke something
9767712 [giwa] WIP: solved partitioned and None is not recognized
4f2d7e6 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
33c0f94d [giwa] edited the comment to add more precise description
774f18d [giwa] removed unnesessary changes
3a671cc [giwa] remove export PYSPARK_PYTHON in spark submit
8efa266 [giwa] fixed PEP-008 violation
fa75d71 [giwa] delete waste file
7f96294 [giwa] added basic operation test cases
3dda31a [giwa] WIP added test case
1f68b78 [giwa] WIP
c05922c [giwa] WIP: added PythonTestInputStream
1fd12ae [giwa] WIP
c880a33 [giwa] update comment
5d22c92 [giwa] WIP
ea4b06b [giwa] initial commit for testcase
5a9b525 [giwa] clean up dstream.py
79c5809 [giwa] added stop in StreamingContext
189dcea [giwa] clean up examples
b8d7d24 [giwa] implemented reduce and count function in Dstream
b6468e6 [giwa] Removed the waste line
b47b5fd [Ken Takagiwa] Kill py4j callback server properly
19ddcdd [Ken Takagiwa] tried to restart callback server
c9fc124 [Tathagata Das] Added extra line.
4caae3f [Tathagata Das] Added missing file
4eff053 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
5e822d4 [Ken Takagiwa] remove waste file
aeaf8a5 [Ken Takagiwa] clean up codes
9fa249b [Ken Takagiwa] clean up code
05459c6 [Ken Takagiwa] fix map function
a9f4ecb [Ken Takagiwa] added count operation but this implementation need double check
d1ee6ca [Ken Takagiwa] edit python sparkstreaming example
0b8b7d0 [Ken Takagiwa] reduceByKey is working
d25d5cf [Ken Takagiwa] added reducedByKey not working yet
7f7c5d1 [Ken Takagiwa] delete old file
967dc26 [Ken Takagiwa] fixed typo of network_workdcount.py
57fb740 [Ken Takagiwa] added doctest for pyspark.streaming.duration
4b69fb1 [Ken Takagiwa] fied input of socketTextDStream
02f618a [Ken Takagiwa] initial commit for socketTextStream
4ce4058 [Ken Takagiwa] remove unused import in python
856d98e [Ken Takagiwa] add empty line
490e338 [Ken Takagiwa] sorted the import following Spark coding convention
5594bd4 [Ken Takagiwa] revert pom.xml
2adca84 [Ken Takagiwa] remove not implemented DStream functions in python
e551e13 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
3758175 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
c5518b4 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
dcf243f [Ken Takagiwa] implementing transform function in Python
9af03f4 [Ken Takagiwa] added reducedByKey not working yet
6e0d9c7 [Ken Takagiwa] modify dstream.py to fix indent error
e497b9b [Ken Takagiwa] comment PythonDStream.PairwiseDStream
5c3a683 [Ken] initial commit for pySparkStreaming
665bfdb [giwa] added testcase for combineByKey
a3d2379 [giwa] added gorupByKey testcase
636090a [giwa] added sparkContext as input parameter in StreamingContext
e7ebb08 [giwa] removed wasted print in DStream
d8b593b [giwa] add comments
ea9c873 [giwa] added TODO coments
89ae38a [giwa] added saveAsTextFiles and saveAsPickledFiles
e3033fc [giwa] remove waste duplicated code
a14c7e1 [giwa] modified streaming test case to add coment
536def4 [giwa] basic function test cases are passed
2112638 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
080541a [giwa] broke something
0704b86 [giwa] WIP: solved partitioned and None is not recognized
90a6484 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
a65f302 [giwa] edited the comment to add more precise description
bdde697 [giwa] removed unnesessary changes
e8c7bfc [giwa] remove export PYSPARK_PYTHON in spark submit
3334169 [giwa] fixed PEP-008 violation
db0a303 [giwa] delete waste file
2cfd3a0 [giwa] added basic operation test cases
90ae568 [giwa] WIP added test case
a120d07 [giwa] WIP
f671cdb [giwa] WIP: added PythonTestInputStream
56fae45 [giwa] WIP
e35e101 [giwa] Merge branch 'master' into testcase
ba5112d [giwa] update comment
28aa56d [giwa] WIP
fb08559 [giwa] initial commit for testcase
a613b85 [giwa] clean up dstream.py
c40c0ef [giwa] added stop in StreamingContext
31e4260 [giwa] clean up examples
d2127d6 [giwa] implemented reduce and count function in Dstream
48f7746 [giwa] Removed the waste line
0f83eaa [Ken Takagiwa] delete py4j 0.8.1
1679808 [Ken Takagiwa] Kill py4j callback server properly
f96cd4e [Ken Takagiwa] tried to restart callback server
fe86198 [Ken Takagiwa] add py4j 0.8.2.1 but server is not launched
1064fe0 [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
28c6620 [Ken Takagiwa] Implemented DStream.foreachRDD in the Python API using Py4J callback server
85b0fe1 [Ken Takagiwa] Merge pull request #1 from tdas/python-foreach
54e2e8c [Tathagata Das] Added extra line.
e185338 [Tathagata Das] Added missing file
a778d4b [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
cc2092b [Ken Takagiwa] remove waste file
d042ac6 [Ken Takagiwa] clean up codes
84a021f [Ken Takagiwa] clean up code
bd20e17 [Ken Takagiwa] fix map function
d01a125 [Ken Takagiwa] added count operation but this implementation need double check
7d05109 [Ken Takagiwa] merge with remote branch
ae464e0 [Ken Takagiwa] edit python sparkstreaming example
04af046 [Ken Takagiwa] reduceByKey is working
3b6d7b0 [Ken Takagiwa] implementing transform function in Python
571d52d [Ken Takagiwa] added reducedByKey not working yet
5720979 [Ken Takagiwa] delete old file
e604fcb [Ken Takagiwa] fixed typo of network_workdcount.py
4b7c08b [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
ce7d426 [Ken Takagiwa] added doctest for pyspark.streaming.duration
a8c9fd5 [Ken Takagiwa] fixed for socketTextStream
a61fa9e [Ken Takagiwa] fied input of socketTextDStream
1e84f41 [Ken Takagiwa] initial commit for socketTextStream
6d012f7 [Ken Takagiwa] remove unused import in python
25d30d5 [Ken Takagiwa] add empty line
6e0a64a [Ken Takagiwa] sorted the import following Spark coding convention
fa4a7fc [Ken Takagiwa] revert streaming/pom.xml
8f8202b [Ken Takagiwa] revert streaming pom.xml
c9d79dd [Ken Takagiwa] revert pom.xml
57e3e52 [Ken Takagiwa] remove not implemented DStream functions in python
0a516f5 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
a7a0b5c [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
72bfc66 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
69e9cd3 [Ken Takagiwa] implementing transform function in Python
94a0787 [Ken Takagiwa] added reducedByKey not working yet
88068cf [Ken Takagiwa] modify dstream.py to fix indent error
1367be5 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
eb2b3ba [Ken] Merge remote-tracking branch 'upstream/master'
d8e51f9 [Ken] initial commit for pySparkStreaming
2014-10-12 02:46:56 -07:00
Joseph K. Bradley b92bd5a2f2 [SPARK-3841] [mllib] Pretty-print params for ML examples
Provide a parent class for the Params case classes used in many MLlib examples, where the parent class pretty-prints the case class fields:
Param1Name	Param1Value
Param2Name	Param2Value
...
Using this class will make it easier to print test settings to logs.

Also, updated DecisionTreeRunner to print a little more info.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2700 from jkbradley/dtrunner-update and squashes the following commits:

cff873f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
7a08ae4 [Joseph K. Bradley] code review comment updates
b4d2043 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
d8228a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
0fc9c64 [Joseph K. Bradley] Added abstract TestParams class for mllib example parameters
12b7798 [Joseph K. Bradley] Added abstract class TestParams for pretty-printing Params values
5f84f03 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
f7441b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
19eb6fc [Joseph K. Bradley] Updated DecisionTreeRunner to print training time.
2014-10-08 14:23:21 -07:00
Reza Zadeh 3d7b36e0de [SPARK-3790][MLlib] CosineSimilarity Example
Provide example  for `RowMatrix.columnSimilarity()`

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2622 from rezazadeh/dimsumexample and squashes the following commits:

8f20b82 [Reza Zadeh] update comment
379066d [Reza Zadeh] cache rows
792b81c [Reza Zadeh] Address review comments
e573c7a [Reza Zadeh] Average absolute error
b15685f [Reza Zadeh] Use scopt. Distribute evaluation.
eca3dfd [Reza Zadeh] Documentation
ac96fb2 [Reza Zadeh] Compute approximation error, add command line.
4533579 [Reza Zadeh] CosineSimilarity Example
2014-10-07 16:40:16 -07:00
aniketbhatnagar 93861a5e87 SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
This patch forces use of commons http client 4.2 in Kinesis-asl profile so that the AWS SDK does not run into dependency conflicts

Author: aniketbhatnagar <aniket.bhatnagar@gmail.com>

Closes #2535 from aniketbhatnagar/Kinesis-HttpClient-Dep-Fix and squashes the following commits:

aa2079f [aniketbhatnagar] Merge branch 'Kinesis-HttpClient-Dep-Fix' of https://github.com/aniketbhatnagar/spark into Kinesis-HttpClient-Dep-Fix
73f55f6 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
70cc75b [aniketbhatnagar] deleted merge files
725dbc9 [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-HttpClient-Dep-Fix' into Kinesis-HttpClient-Dep-Fix
4ed61d8 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
9cd6103 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
2014-10-01 18:31:18 -07:00
jyotiska 17333c7a3c Python SQL Example Code
SQL example code for Python, as shown on [SQL Programming Guide](https://spark.apache.org/docs/1.0.2/sql-programming-guide.html)

Author: jyotiska <jyotiska123@gmail.com>

Closes #2521 from jyotiska/sql_example and squashes the following commits:

1471dcb [jyotiska] added imports for sql
b25e436 [jyotiska] pep 8 compliance
43fd10a [jyotiska] lines broken to maintain 80 char limit
b4fdf4e [jyotiska] removed blank lines
83d5ab7 [jyotiska] added inferschema and applyschema to the demo
306667e [jyotiska] replaced blank line with end line
c90502a [jyotiska] fixed new line
4939a70 [jyotiska] added new line at end for python style
0b46148 [jyotiska] fixed appname for python sql example
8f67b5b [jyotiska] added python sql example
2014-10-01 13:52:50 -07:00
Gaspar Munoz b81ee0b46d Typo error in KafkaWordCount example
topicpMap to topicMap

Author: Gaspar Munoz <munozs.88@gmail.com>

Closes #2614 from gasparms/patch-1 and squashes the following commits:

00aab2c [Gaspar Munoz] Typo error in KafkaWordCount example
2014-10-01 13:47:22 -07:00
Sean Owen dcb2f73f1c SPARK-2626 [DOCS] Stop SparkContext in all examples
Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)

Author: Sean Owen <sowen@cloudera.com>

Closes #2575 from srowen/SPARK-2626 and squashes the following commits:

5b2baae [Sean Owen] Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
2014-10-01 11:28:22 -07:00
Joseph K. Bradley 7bf6cc9701 [SPARK-3751] [mllib] DecisionTree: example update + print options
DecisionTreeRunner functionality additions:
* Allow user to pass in a test dataset
* Do not print full model if the model is too large.

As part of this, modify DecisionTreeModel and RandomForestModel to allow printing less info.  Proposed updates:
* toString: prints model summary
* toDebugString: prints full model (named after RDD.toDebugString)

Similar update to Python API:
* __repr__() now prints a model summary
* toDebugString() now prints the full model

CC: mengxr  chouqin manishamde codedeft  Small update (whomever can take a look).  Thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2604 from jkbradley/dtrunner-update and squashes the following commits:

b2b3c60 [Joseph K. Bradley] re-added python sql doc test, temporarily removed before
07b1fae [Joseph K. Bradley] repr() now prints a model summary toDebugString() now prints the full model
1d0d93d [Joseph K. Bradley] Updated DT and RF to print less when toString is called. Added toDebugString for verbose printing.
22eac8c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
e007a95 [Joseph K. Bradley] Updated DecisionTreeRunner to accept a test dataset.
2014-10-01 01:03:24 -07:00
Joseph K. Bradley 0dc2b6361d [SPARK-1545] [mllib] Add Random Forests
This PR adds RandomForest to MLlib.  The implementation is basic, and future performance optimizations will be important.  (Note: RFs = Random Forests.)

# Overview

## RandomForest
* trains multiple trees at once to reduce the number of passes over the data
* allows feature subsets at each node
* uses a queue of nodes instead of fixed groups for each level

This implementation is based an implementation by manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations).  Thank you for your inputs!

## Testing

Correctness: This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets.

Performance: This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs).  Results below.

### Regression tests for DecisionTree

Summary: For training 1 tree, there are small regressions, especially from feature subsampling.

In the table below, each row is a single (random) dataset.  The 2 different sets of result columns are for 2 different RF implementations:
* (numTrees): This is from an earlier commit, after implementing RandomForest to train multiple trees at once.  It does not include any code for feature subsampling.
* (feature subsets): This is from this current PR's code, after implementing feature subsampling.
These tests were to identify regressions in DecisionTree, so they are training 1 tree with all of the features (i.e., no feature subsampling).

These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels).  Speedup values < 1 indicate slowdowns from the old DecisionTree implementation.

numInstances | numFeatures | runtime (sec) | speedup | runtime (sec) | speedup
---- | ---- | ---- | ---- | ---- | ----
 | | (numTrees) | (numTrees) | (feature subsets) | (feature subsets)
20000 | 100 | 4.051 | 1.044433473 | 4.478 | 0.9448414471
20000 | 500 | 8.472 | 1.104461756 | 9.315 | 1.004508857
20000 | 1500 | 19.354 | 1.05854087 | 20.863 | 0.9819776638
20000 | 3500 | 43.674 | 1.072033704 | 45.887 | 1.020332556
200000 | 100 | 4.196 | 1.171830315 | 4.848 | 1.014232673
200000 | 500 | 8.926 | 1.082791844 | 9.771 | 0.989151571
200000 | 1500 | 20.58 | 1.068415938 | 22.134 | 0.9934038131
200000 | 3500 | 48.043 | 1.075203464 | 52.249 | 0.9886505005
2000000 | 100 | 4.944 | 1.01355178 | 5.796 | 0.8645617667
2000000 | 500 | 11.11 | 1.016831683 | 12.482 | 0.9050632911
2000000 | 1500 | 31.144 | 1.017852556 | 35.274 | 0.8986789136
2000000 | 3500 | 79.981 | 1.085382778 | 101.105 | 0.8586123337
20000000 | 100 | 8.304 | 0.9270231214 | 9.073 | 0.8484514494
20000000 | 500 | 28.174 | 1.083268262 | 34.236 | 0.8914592826
20000000 | 1500 | 143.97 | 0.9579634646 | 159.275 | 0.8659111599

### Tests for forests

I have run other tests with numTrees=10 and with sqrt(numFeatures), and those indicate that multi-model training and feature subsets can speed up training for forests, especially when training deeper trees.

# Details on specific classes

## Changes to DecisionTree
* Main train() method is now in RandomForest.
* findBestSplits() is no longer needed.  (It split levels into groups, but we now use a queue of nodes.)
* Many small changes to support RFs.  (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.)

## RandomForest
* Main train() method is from old DecisionTree.
* selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage.

## RandomForestModel
* Stores an Array[DecisionTreeModel]
* Prediction:
 * For classification, most common label.  For regression, mean.
 * We could support other methods later.

## examples/.../DecisionTreeRunner
* This now takes numTrees and featureSubsetStrategy, to support RFs.

## DTStatsAggregator
* 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods.  (We could treat both as subsampling, but this is less efficient
  DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality.

## impurities
* These now take instance weights.

## Node
* Some vals changed to vars.
 * This is unfortunately a public API change (DeveloperApi).  This could be avoided by creating a LearningNode struct, but would be awkward.

## RandomForestSuite
Please let me know if there are missing tests!

## BaggedPoint
This wraps TreePoint and holds bootstrap weights/counts.

# Design decisions

* BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on.

* RandomForest public API: What options should be easily supported by the train* methods?  Should ALL options be in the Java-friendly constructors?  Should there be a constructor taking Strategy?

* Feature subsampling options: What options should be supported?  scikit-learn supports the same options, except for "onethird."  One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1].

* Splits and bins are computed before bootstrapping, so all trees use the same discretization.

* One queue, instead of one queue per tree.

CC: mengxr manishamde codedeft chouqin  Please let me know if you have suggestions---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2435 from jkbradley/rfs-new and squashes the following commits:

c694174 [Joseph K. Bradley] Fixed typo
cc59d78 [Joseph K. Bradley] fixed imports
e25909f [Joseph K. Bradley] Simplified node group maps.  Specifically, created NodeIndexInfo to store node index in agg and feature subsets, and no longer create extra maps in findBestSplits
fbe9a1e [Joseph K. Bradley] Changed default featureSubsetStrategy to be sqrt for classification, onethird for regression.  Updated docs with references.
ef7c293 [Joseph K. Bradley] Updates based on code review.  Most substantial changes: * Simplified DTStatsAggregator * Made RandomForestModel.trees public * Added test for regression to RandomForestSuite
593b13c [Joseph K. Bradley] Fixed bug in metadata for computing log2(num features).  Now it checks >= 1.
a1a08df [Joseph K. Bradley] Removed old comments
866e766 [Joseph K. Bradley] Changed RandomForestSuite randomized tests to use multiple fixed random seeds.
ff8bb96 [Joseph K. Bradley] removed usage of null from RandomForest and replaced with Option
bf1a4c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
6b79c07 [Joseph K. Bradley] Added RandomForestSuite, and fixed small bugs, style issues.
d7753d4 [Joseph K. Bradley] Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to support RandomForest).  Fixed bugs so that RandomForest now runs.
746d43c [Joseph K. Bradley] Implemented feature subsampling.  Tested DecisionTree but not RandomForest.
6309d1d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new.  Added RandomForestModel.toString
b7ae594 [Joseph K. Bradley] Updated docs.  Small fix for bug which does not cause errors: No longer allocate unused child nodes for leaf nodes.
121c74e [Joseph K. Bradley] Basic random forests are implemented.  Random features per node not yet implemented.  Test suite not implemented.
325d18a [Joseph K. Bradley] Merge branch 'chouqin-dt-preprune' into rfs-new
4ef9bf1 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
6da8571 [Joseph K. Bradley] RFs partly implemented, not done yet
eddd1eb [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new
5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1
0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc
eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix
d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated
d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160
1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training.
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code:
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-28 21:44:50 -07:00
Uri Laserson 248232936e [SPARK-3389] Add Converter for ease of Parquet reading in PySpark
https://issues.apache.org/jira/browse/SPARK-3389

Author: Uri Laserson <laserson@cloudera.com>

Closes #2256 from laserson/SPARK-3389 and squashes the following commits:

0ed363e [Uri Laserson] PEP8'd the python file
0b4b380 [Uri Laserson] Moved converter to examples and added python example
eecf4dc [Uri Laserson] [SPARK-3389] Add Converter for ease of Parquet reading in PySpark
2014-09-27 21:48:05 -07:00
Matthew Farrellee a03e5b81e9 [SPARK-1701] [PySpark] remove slice terminology from python examples
Author: Matthew Farrellee <matt@redhat.com>

Closes #2304 from mattf/SPARK-1701-partition-over-slice-for-python-examples and squashes the following commits:

928a581 [Matthew Farrellee] [SPARK-1701] [PySpark] remove slice terminology from python examples
2014-09-19 14:35:22 -07:00
qiping.lqp fdb302f49c [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
Added minInstancesPerNode, minInfoGain params to:
* DecisionTreeRunner.scala example
* Python API (tree.py)

Also:
* Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
* small style fixes

CC: mengxr

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:

61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-15 17:43:26 -07:00
Prashant Sharma f493f7982b [SPARK-3452] Maven build should skip publishing artifacts people shouldn...
...'t depend on

Publish local in maven term is `install`

and publish otherwise is `deploy`

So disabled both for following projects.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2329 from ScrapCodes/SPARK-3452/maven-skip-install and squashes the following commits:

257b79a [Prashant Sharma] [SPARK-3452] Maven build should skip publishing artifacts people shouldn't depend on
2014-09-14 21:17:29 -07:00
Prashant Sharma 02b5ac7191 Minor - Fix trivial compilation warnings.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2331 from ScrapCodes/compilation-warn and squashes the following commits:

44c1e76 [Prashant Sharma] Minor - Fix trivial compilation warnings.
2014-09-09 14:42:28 -07:00
Xiangrui Meng 50a4fa774a [SPARK-3443][MLLIB] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
2014-09-08 18:59:57 -07:00
GuoQiang Li 607ae39c22 [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
Author: GuoQiang Li <witgo@qq.com>

Closes #2268 from witgo/SPARK-3397 and squashes the following commits:

eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
2014-09-06 15:04:50 -07:00
Nicholas Chammas 9422c4ee0e [SPARK-3361] Expand PEP 8 checks to include EC2 script and Python examples
This PR resolves [SPARK-3361](https://issues.apache.org/jira/browse/SPARK-3361) by expanding the PEP 8 checks to cover the remaining Python code base:
* The EC2 script
* All Python / PySpark examples

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2297 from nchammas/pep8-rulez and squashes the following commits:

1e5ac9a [Nicholas Chammas] PEP 8 fixes to Python examples
c3dbeff [Nicholas Chammas] PEP 8 fixes to EC2 script
65ef6e8 [Nicholas Chammas] expand PEP 8 checks
2014-09-05 23:08:54 -07:00
RJ Nowling e5d376801d [SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR #720
PR #720 made multiple changes to GraphGenerator.logNormalGraph including:

* Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect.
* Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated.
* Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter.
* Failed to update scala docs and programming guide for API changes
* Added a Synthetic Benchmark in the examples.

This PR:
* Removes the in-line calls and calls original vertex / edge generation functions again
* Adds an optional seed parameter for deterministic behavior (when desired)
* Keeps the number of partitions parameter that was added.
* Keeps compatibility with the synthetic benchmark example
* Maintains backwards-compatible API

Author: RJ Nowling <rnowling@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>

Closes #2168 from rnowling/graphgenrand and squashes the following commits:

f1cd79f [Ankur Dave] Style fixes
e11918e [RJ Nowling] Fix bad comparisons in unit tests
785ac70 [RJ Nowling] Fix style error
c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed
41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed
799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal
43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges
2faf75f [RJ Nowling] Added unit test for logNormalGraph
82f22397 [RJ Nowling] Add unit test for sampleLogNormal
b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing
6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges
1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing
dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs
b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed
1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate
98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters
d40141a [RJ Nowling] Fix style error
684804d [RJ Nowling] revert PR #720 which introduce errors in logNormalGraph and messed up seeding of RNGs.  Add user-defined optional seed for deterministic behavior
c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed.
015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit 894ecde04
2014-09-03 14:16:06 -07:00
Larry Xiao 7c92b49d6b [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples
to support ~/spark/bin/run-example GraphXAnalytics triangles
/soc-LiveJournal1.txt --numEPart=256

Author: Larry Xiao <xiaodi@sjtu.edu.cn>

Closes #1766 from larryxiao/1986 and squashes the following commits:

bb77cd9 [Larry Xiao] [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples
2014-09-02 18:29:08 -07:00
Marcelo Vanzin b6cf134817 [SPARK-2889] Create Hadoop config objects consistently.
Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config.

This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext.

The places that were not changed fall in one of the following categories:
- Test code where this doesn't really matter
- Places deep in the code where plumbing SparkConf would be too difficult for very little gain
- Default values for arguments - since the caller can provide their own config in that case

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1843 from vanzin/SPARK-2889 and squashes the following commits:

52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works.
53f9506 [Marcelo Vanzin] Add DeveloperApi annotation.
3d345cb [Marcelo Vanzin] Restore old method for backwards compat.
fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
3f26760 [Marcelo Vanzin] Compilation fix.
f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil.
b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument.
1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.
2014-08-30 14:48:07 -07:00
wangfei 13901764f4 [SPARK-3296][mllib] spark-example should be run-example in head notation of DenseKMeans and SparseNaiveBayes
`./bin/spark-example`  should be `./bin/run-example` in DenseKMeans and SparseNaiveBayes

Author: wangfei <wangfei_hello@126.com>

Closes #2193 from scwf/run-example and squashes the following commits:

207eb3a [wangfei] spark-example should be run-example
27a8999 [wangfei] ./bin/spark-example should be ./bin/run-example
2014-08-29 17:37:15 -07:00
Yadong Qi 39012452da [SPARK-3285] [examples] Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
def sum[B >: A](implicit num: Numeric[B]): B = foldLeft(num.zero)(num.plus)
Using values.sum is easier to understand than using values.foldLeft(0)(_ + _), so we'd better use values.sum instead of values.foldLeft(0)(_ + _)

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #2182 from watermen/bug-fix3 and squashes the following commits:

17be9fb [Yadong Qi] Update CheckpointSuite.scala
714bda5 [Yadong Qi] Update BasicOperationsSuite.scala
57e704c [Yadong Qi] Update StatefulNetworkWordCount.scala
2014-08-28 14:08:48 -07:00
Kousuke Saruta 62f5009f67 [SPARK-2976] Replace tabs with spaces
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1895 from sarutak/SPARK-2976 and squashes the following commits:

1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976
d1e0666 [Kousuke Saruta] Modified styles
c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java
c003b36 [Kousuke Saruta] Removed tab from sorttable.js
2014-08-25 19:40:23 -07:00
Joseph K. Bradley 050f8d01e4 [SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples)
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr  @manishamde @atalwalkar

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2063 from jkbradley/dt-docs and squashes the following commits:

2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation.  Added Java, Python examples.
2014-08-21 00:17:29 -07:00
Marcelo Vanzin c9f743957f [SPARK-2848] Shade Guava in uber-jars.
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
2014-08-20 16:23:10 -07:00
Xiangrui Meng 217b5e915e [SPARK-3108][MLLIB] add predictOnValues to StreamingLR and fix predictOn
It is useful in streaming to allow users to carry extra data with the prediction, for monitoring the prediction error for example. freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2023 from mengxr/predict-on-values and squashes the following commits:

cac47b8 [Xiangrui Meng] add classtag
2821b3b [Xiangrui Meng] use mapValues
0925efa [Xiangrui Meng] add predictOnValues to StreamingLR and fix predictOn
2014-08-18 18:20:54 -07:00
Joseph K. Bradley c8b16ca0d8 [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 18:01:39 -07:00
Joseph K. Bradley 115eeb30dd [mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
* Fixed same bug in python doc tests, and added tree.py to doc tests.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:

b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-18 14:40:05 -07:00
Xiangrui Meng 7e70708a99 [SPARK-3048][MLLIB] add LabeledPoint.parse and remove loadStreamingLabeledPoints
Move `parse()` from `LabeledPointParser` to `LabeledPoint` and make it public. This breaks binary compatibility only when a user uses synthesized methods like `tupled` and `curried`, which is rare.

`LabeledPoint.parse` is more consistent with `Vectors.parse`, which is why `LabeledPointParser` is not preferred.

freeman-lab tdas

Author: Xiangrui Meng <meng@databricks.com>

Closes #1952 from mengxr/labelparser and squashes the following commits:

c818fb2 [Xiangrui Meng] merge master
ce20e6f [Xiangrui Meng] update mima excludes
b386b8d [Xiangrui Meng] fix tests
2436b3d [Xiangrui Meng] add parse() to LabeledPoint
2014-08-16 15:13:34 -07:00
Xiangrui Meng 5d25c0b74f [SPARK-3078][MLLIB] Make LRWithLBFGS API consistent with others
Should ask users to set parameters through the optimizer. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #1973 from mengxr/lr-lbfgs and squashes the following commits:

e3efbb1 [Xiangrui Meng] fix tests
21b3579 [Xiangrui Meng] fix method name
641eea4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lr-lbfgs
456ab7c [Xiangrui Meng] update LRWithLBFGS
2014-08-15 21:04:29 -07:00
Kan Zhang 9422a9b084 [SPARK-2736] PySpark converter and example script for reading Avro files
JIRA: https://issues.apache.org/jira/browse/SPARK-2736

This patch includes:
1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
3. Fixing a classloading issue.

cc @MLnick @JoshRosen @mateiz

Author: Kan Zhang <kzhang@apache.org>

Closes #1916 from kanzhang/SPARK-2736 and squashes the following commits:

02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
82cc505 [Kan Zhang] [SPARK-2736] Update data sample
0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
2014-08-14 19:03:51 -07:00
Michael Armbrust 236dfac676 [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.

The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.

**This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**

For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.

Author: Michael Armbrust <michael@databricks.com>

Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:

ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
20c43f8 [Michael Armbrust] override function instead of just setting the value
7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
2014-08-03 12:28:29 -07:00
Michael Armbrust 1a8043739d [SPARK-2739][SQL] Rename registerAsTable to registerTempTable
There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.

Author: Michael Armbrust <michael@databricks.com>

Closes #1743 from marmbrus/registerTempTable and squashes the following commits:

d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
4dff086 [Michael Armbrust] Fix .java files too
89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
2014-08-02 18:27:04 -07:00
Chris Fregly 91f9504e60 [SPARK-1981] Add AWS Kinesis streaming support
Author: Chris Fregly <chris@fregly.com>

Closes #1434 from cfregly/master and squashes the following commits:

4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
2014-08-02 13:35:35 -07:00
Joseph K. Bradley 3f67382e7c [SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.

API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()

Examples and testing:
* Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors

Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.

CC mengxr manishamde

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:

3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
fa10ea7 [Joseph K. Bradley] Small style update
7968692 [Joseph K. Bradley] small braces typo fix
e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
93953f1 [Joseph K. Bradley] Likely done with Python API.
6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
2b20c61 [Joseph K. Bradley] Small doc and style updates
1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
2014-08-02 13:07:17 -07:00
Jeremy Freeman f6a1899306 Streaming mllib [SPARK-2438][MLLIB]
This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.

__Summary of additions:__

_StreamingLinearAlgorithm_
- An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.

_StreamingLinearRegressionWithSGD_
- Class and companion object for running streaming linear regression

_StreamingLinearRegressionTestSuite_
- Unit tests

_StreamingLinearRegression_
- Example use case: fitting a model online to data from one stream, and making predictions on other data

__Notes__
- If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).

Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: freeman <the.freeman.lab@gmail.com>

Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:

775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
4086fee [Jeremy Freeman] Fixed current weight formatting
8b95b27 [Jeremy Freeman] Restored broadcasting
29f27ec [Jeremy Freeman] Formatting
8711c41 [Jeremy Freeman] Used return to avoid indentation
777b596 [Jeremy Freeman] Restored treeAggregate
74cf440 [Jeremy Freeman] Removed static methods
d28cf9a [Jeremy Freeman] Added usage notes
c3326e7 [Jeremy Freeman] Improved documentation
9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
66eba5e [Jeremy Freeman] Fixed line lengths
2fe0720 [Jeremy Freeman] Minor cleanup
7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
b9b69f6 [Jeremy Freeman] Added setter methods
c3f8b5a [Jeremy Freeman] Modified logging
00aafdc [Jeremy Freeman] Add modifiers
14b801e [Jeremy Freeman] Name changes
c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
4b0a5d3 [Jeremy Freeman] Cleaned up tests
74188d6 [Jeremy Freeman] Eliminate dependency on commons
50dd237 [Jeremy Freeman] Removed experimental tag
6bfe1e6 [Jeremy Freeman] Fixed imports
a2a63ad [freeman] Makes convergence test more robust
86220bc [freeman] Streaming linear regression unit tests
fb4683a [freeman] Minor changes for scalastyle consistency
fd31e03 [freeman] Changed logging behavior
453974e [freeman] Fixed indentation
c4b1143 [freeman] Streaming linear regression
604f4d7 [freeman] Expanded private class to include mllib
d99aa85 [freeman] Helper methods for streaming MLlib apps
0898add [freeman] Added dependency on streaming
2014-08-01 20:10:26 -07:00
Joseph K. Bradley b124de584a [SPARK-2756] [mllib] Decision tree bug fixes
(1) Inconsistent aggregate (agg) indexing for unordered features.
(2) Fixed gain calculations for edge cases.
(3) One-off error in choosing thresholds for continuous features for small datasets.
(4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)

Other updates, to help with tests:
* Updated DecisionTreeRunner to print more info.
* Added utility functions to DecisionTreeModel: toString, depth, numNodes
* Improved internal DecisionTree documentation

Bug fix details:

(1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.

Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
* Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.

(2) Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.

Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).

Small updates to documentation and coding style.

(3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.

* Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
* Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
* Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
* Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.

CC: mengxr manishamde  Please let me know if I missed something!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:

2b20c61 [Joseph K. Bradley] Small doc and style updates
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
2014-07-31 20:51:48 -07:00
Michael Armbrust 72cfb13987 [SPARK-2397][SQL] Deprecate LocalHiveContext
LocalHiveContext is redundant with HiveContext.  The only difference is it creates `./metastore` instead of `./metastore_db`.

Author: Michael Armbrust <michael@databricks.com>

Closes #1641 from marmbrus/localHiveContext and squashes the following commits:

e5ec497 [Michael Armbrust] Add deprecation version
626e056 [Michael Armbrust] Don't remove from imports yet
905cc5f [Michael Armbrust] Merge remote-tracking branch 'apache/master' into localHiveContext
1c2727e [Michael Armbrust] Deprecate LocalHiveContext
2014-07-31 11:26:43 -07:00
strat0sphere da50176683 Update DecisionTreeRunner.scala
Author: strat0sphere <stratos.dimopoulos@gmail.com>

Closes #1676 from strat0sphere/patch-1 and squashes the following commits:

044d2fa [strat0sphere] Update DecisionTreeRunner.scala
2014-07-30 17:57:50 -07:00
Sean Owen e9b275b769 SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets
Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.

Author: Sean Owen <srowen@gmail.com>

Closes #1663 from srowen/SPARK-2341 and squashes the following commits:

8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
18a8c8e [Sean Owen] Updates from review
83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
2014-07-30 17:34:32 -07:00
Kan Zhang 94d1f46fc4 [SPARK-2024] Add saveAsSequenceFile to PySpark
JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024

This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.

* Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.

* Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.

* No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.

* Added HBase and Cassandra output examples to show how custom output formats and converters can be used.

cc MLnick mateiz ahirreddy pwendell

Author: Kan Zhang <kzhang@apache.org>

Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:

c01e3ef [Kan Zhang] [SPARK-2024] code formatting
6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
2014-07-30 13:19:05 -07:00
Hari Shreedharan 800ecff4b1 [STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu...
...sh model

Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
receiver fails, it currently has to be restarted on the same node to be able to receive data.

This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
multiple threads for better performance.

Author: Hari Shreedharan <harishreedharan@gmail.com>
Author: Hari Shreedharan <hshreedharan@apache.org>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: harishreedharan <hshreedharan@cloudera.com>

Closes #807 from harishreedharan/master and squashes the following commits:

e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
1f47364 [Hari Shreedharan] Minor fixes.
73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
65b76b4 [Hari Shreedharan] Fixing the unit test.
e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
799509f [Hari Shreedharan] Fix a compile issue.
3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
8c00289 [Hari Shreedharan] More debug messages
393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
205034d [Hari Shreedharan] Merging master in
4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
2014-07-29 11:11:29 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Burak a4d60208ec [SPARK-2434][MLlib]: Warning messages that point users to original MLlib implementations added to Examples
[SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified:
Scala:
* LocalALS
* LocalFileLR
* LocalKMeans
* LocalLP
* SparkALS
* SparkHdfsLR
* SparkKMeans
* SparkLR
Python:
 * kmeans.py
 * als.py
 * logistic_regression.py

Author: Burak <brkyvz@gmail.com>

Closes #1515 from brkyvz/SPARK-2434 and squashes the following commits:

7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation
b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed
4762f39 [Burak] [SPARK-2434]: Warning messages added
17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms
2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.
2014-07-21 17:03:40 -07:00
Manish Amde d88f6be446 [MLlib] SPARK-1536: multiclass classification support for decision tree
The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.

cc: mengxr, etrain, hirakendu, atalwalkar, srowen

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #886 from manishamde/multiclass and squashes the following commits:

26f8acc [Manish Amde] another attempt at fixing mima
c5b2d04 [Manish Amde] more MIMA fixes
1ce7212 [Manish Amde] change problem filter for mima
10fdd82 [Manish Amde] fixing MIMA excludes
e1c970d [Manish Amde] merged master
abf2901 [Manish Amde] adding classes to MimaExcludes.scala
45e767a [Manish Amde] adding developer api annotation for overriden methods
c8428c4 [Manish Amde] fixing weird multiline bug
afced16 [Manish Amde] removed label weights support
2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
4e85f2c [Manish Amde] minor: fixed scalastyle issues
b2ae41f [Manish Amde] minor: scalastyle
e4c1321 [Manish Amde] using while loop for regression histograms
d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
0fecd38 [Manish Amde] minor: add newline to EOF
2061cf5 [Manish Amde] merged from master
06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
9cc3e31 [Manish Amde] added implicit conversion import
5c1b2ca [Manish Amde] doc for PointConverter class
485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
3d7f911 [Manish Amde] updated doc
8e44ab8 [Manish Amde] updated doc
adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
e3e8843 [Manish Amde] minor code formatting
23d4268 [Manish Amde] minor: another minor code style
34ee7b9 [Manish Amde] minor: code style
237762d [Manish Amde] renaming functions
12e6d0a [Manish Amde] minor: removing line in doc
9a90c93 [Manish Amde] Merge branch 'master' into multiclass
1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
f5f6b83 [Manish Amde] multiclass for continous variables
8cfd3b6 [Manish Amde] working for categorical multiclass classification
828ff16 [Manish Amde] added categorical variable test
bce835f [Manish Amde] code cleanup
7e5f08c [Manish Amde] minor doc
1dd2735 [Manish Amde] bin search logic for multiclass
f16a9bb [Manish Amde] fixing while loop
d811425 [Manish Amde] multiclass bin aggregate logic
ab5cb21 [Manish Amde] multiclass logic
d8e4a11 [Manish Amde] sample weights
ed5a2df [Manish Amde] fixed classification requirements
d012be7 [Manish Amde] fixed while loop
18d2835 [Manish Amde] changing default values for num classes
6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
75f2bfc [Manish Amde] minor code style fix
e547151 [Manish Amde] minor modifications
34549d0 [Manish Amde] fixing error during merge
098e8c5 [Manish Amde] merged master
e006f9d [Manish Amde] changing variable names
5c78e1a [Manish Amde] added multiclass support
6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
46e06ee [Manish Amde] minor mods
3f85a17 [Manish Amde] tests for multiclass classification
4d5f70c [Manish Amde] added multiclass support for find splits bins
46f909c [Manish Amde] todo for multiclass support
455bea9 [Manish Amde] fixed tests
14aea48 [Manish Amde] changing instance format to weighted labeled point
a1a6e09 [Manish Amde] added weighted point class
968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-07-18 14:00:13 -07:00
Cheng Hao 29809a6d58 [SPARK-2570] [SQL] Fix the bug of ClassCastException
Exception thrown when running the example of HiveFromSpark.
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
	at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45)
	at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1475 from chenghao-intel/hive_from_spark and squashes the following commits:

d4c0500 [Cheng Hao] Fix the bug of ClassCastException
2014-07-17 23:25:01 -07:00
Artjom-Metro ae8ca4dfba SPARK-2427: Fix Scala examples that use the wrong command line arguments index
The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script.

Author: Artjom-Metro <Artjom-Metro@users.noreply.github.com>
Author: Artjom-Metro <artjom31415@googlemail.com>

Closes #1353 from Artjom-Metro/fix_examples and squashes the following commits:

6111801 [Artjom-Metro] Reduce the default number of iterations
cfaa73c [Artjom-Metro] Fix some examples that use the wrong index to access the command line arguments
2014-07-10 16:03:30 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Raymond Liu 2b18ea9826 Clean up SparkKMeans example's code
remove unused code

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1352 from colorant/kmeans and squashes the following commits:

ddcd1dd [Raymond Liu] Clean up SparkKMeans example's code
2014-07-09 23:39:29 -07:00
Neville Li f7ce1b3b48 [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS
Author: Neville Li <neville@spotify.com>

Closes #1319 from nevillelyh/gh/SPARK-1977 and squashes the following commits:

1f0a355 [Neville Li] [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS
2014-07-07 15:06:14 -07:00
Yin Huai d2f4f30b12 [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2060

Programming guide: http://yhuai.github.io/site/sql-programming-guide.html

Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #999 from yhuai/newJson and squashes the following commits:

227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ce8eedd [Yin Huai] rxin's comments.
bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
94ffdaa [Yin Huai] Remove "get" from method names.
ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
79ea9ba [Yin Huai] Fix typos.
5428451 [Yin Huai] Newline
1f908ce [Yin Huai] Remove extra line.
d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
7ea750e [Yin Huai] marmbrus's comments.
6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
83013fb [Yin Huai] Update Java Example.
e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4fbddf0 [Yin Huai] Programming guide.
9df8c5a [Yin Huai] Python API.
7027634 [Yin Huai] Java API.
cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ab810b0 [Yin Huai] Make JsonRDD private.
6df0891 [Yin Huai] Apache header.
8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
8ffed79 [Yin Huai] Update the example.
a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
65b87f0 [Yin Huai] Fix sampling...
8846af5 [Yin Huai] API doc.
52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
0387523 [Yin Huai] Address PR comments.
666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
a2313a6 [Yin Huai] Address PR comments.
f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
0576406 [Yin Huai] Add Apache license header.
af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 19:14:59 -07:00
Tor Myklebust d9203350b0 [SPARK-1672][MLLIB] Separate user and product partitioning in ALS
Some clean up work following #593.

1. Allow to set different number user blocks and number product blocks in `ALS`.
2. Update `MovieLensALS` to reflect the change.

Author: Tor Myklebust <tmyklebu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:

0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
36420c7 [Xiangrui Meng] set exclusion rules for ALS
9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
d17a8bf [Xiangrui Meng] merge master
a4925fd [Tor Myklebust] Style.
bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
021f54b [Tor Myklebust] Separate user and product blocks.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-06-11 18:16:33 -07:00
Nick Pentreath f971d6cb60 SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats
So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.

This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.

# Overview
The basics are as follows:
1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
4. ```PickleSerializer``` on the Python side deserializes.

This works "out the box" for simple ```Writable```s:
* ```Text```
* ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
* ```NullWritable```
* ```BooleanWritable```
* ```BytesWritable```
* ```MapWritable```

It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).

I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
```python
conf = {"es.resource" : "index/type" }
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
```

I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.

# Some things still outstanding:
1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR

Author: Nick Pentreath <nick.pentreath@gmail.com>

Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:

268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
4c972d8 [Nick Pentreath] Add license headers
d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
cde6af9 [Nick Pentreath] Parameterize converter trait
5ebacfa [Nick Pentreath] Update docs for PySpark input formats
a985492 [Nick Pentreath] Move Converter examples to own package
365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
b65606f [Nick Pentreath] Add converter interface
5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
1a4a1d6 [Nick Pentreath] Address @mateiz style comments
01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
84fe8e3 [Nick Pentreath] Python programming guide space formatting
d0f52b6 [Nick Pentreath] Python programming guide
7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
93ef995 [Nick Pentreath] Add back context.py changes
9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
35b8e3a [Nick Pentreath] Another fix for test ordering
bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e001b94 [Nick Pentreath] Fix test failures due to ordering
78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
64eb051 [Nick Pentreath] Scalastyle fix
e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
17a656b [Nick Pentreath] remove binary sequencefile for tests
f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
31a2fff [Nick Pentreath] Scalastyle fixes
fc5099e [Nick Pentreath] Add Apache license headers
4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
f6aac55 [Nick Pentreath] Bring back msgpack
9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
65360d5 [Nick Pentreath] Adding test SequenceFiles
0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
d72bf18 [Nick Pentreath] msgpack
dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e67212a [Nick Pentreath] Add back msgpack dependency
f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
97ef708 [Nick Pentreath] Remove old writeToStream
2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
174f520 [Nick Pentreath] Add back graphx settings
703ee65 [Nick Pentreath] Add back msgpack
619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
eb40036 [Nick Pentreath] Remove unused comment lines
4d7ef2e [Nick Pentreath] Fix indentation
f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
2014-06-09 22:21:03 -07:00
zsxwing a71c6d1cf0 SPARK-1628: Add missing hashCode methods in Partitioner subclasses
JIRA: https://issues.apache.org/jira/browse/SPARK-1628

Added `hashCode` in HashPartitioner, RangePartitioner, PythonPartitioner and PageRankUtils.CustomPartitioner.

Author: zsxwing <zsxwing@gmail.com>

Closes #549 from zsxwing/SPARK-1628 and squashes the following commits:

2620936 [zsxwing] SPARK-1628: Add missing hashCode methods in Partitioner subclasses
2014-06-08 14:18:52 -07:00
Takuya UESHIN 7c160293d6 [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits:

e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
2014-06-05 11:27:33 -07:00
Xiangrui Meng 189df165bb [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points
We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:

1. dense vector: `[v0,v1,..]`
2. sparse vector: `(size,[i0,i1],[v0,v1])`
3. labeled point: `(label,vector)`

where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.

`MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.

CC: @mateiz, @srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #685 from mengxr/labeled-io and squashes the following commits:

2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
56746ea [Xiangrui Meng] replace # by .
623a5f0 [Xiangrui Meng] merge master
f06d5ba [Xiangrui Meng] add docs and minor updates
640fe0c [Xiangrui Meng] throw SparkException
5bcfbc4 [Xiangrui Meng] update test to add scientific notations
e86bf38 [Xiangrui Meng] remove NumericTokenizer
050fca4 [Xiangrui Meng] use StringTokenizer
6155b75 [Xiangrui Meng] merge master
f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
aea4ae3 [Xiangrui Meng] minor updates
810d6df [Xiangrui Meng] update tokenizer/parser implementation
7aac03a [Xiangrui Meng] remove Scala parsers
c1885c1 [Xiangrui Meng] add headers and minor changes
b0c50cb [Xiangrui Meng] add customized parser
d731817 [Xiangrui Meng] style update
63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
2014-06-04 12:56:56 -07:00
Joseph E. Gonzalez 894ecde04f Synthetic GraphX Benchmark
This PR accomplishes two things:

1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets

2. This PR improves the implementation of the log-normal graph generator.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>

Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:

e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
bccccad [Ankur Dave] Fix long lines
374678a [Ankur Dave] Bugfix and style changes
1bdf39a [Joseph E. Gonzalez] updating options
d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
2014-06-03 14:14:48 -07:00
baishuo(白硕) aa41a522d8 fix java.lang.ClassCastException
get Exception when run:bin/run-example org.apache.spark.examples.sql.RDDRelation
Exception's detail is:
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
	at org.apache.spark.examples.sql.RDDRelation$.main(RDDRelation.scala:49)
	at org.apache.spark.examples.sql.RDDRelation.main(RDDRelation.scala)
change sql("SELECT COUNT(*) FROM records").collect().head.getInt(0) to sql("SELECT COUNT(*) FROM records").collect().head.getLong(0), then the Exception do not occur any more

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #949 from baishuo/master and squashes the following commits:

f4b319f [baishuo(白硕)] fix java.lang.ClassCastException
2014-06-03 13:39:47 -07:00
Reynold Xin d79c2b28e1 Fix PEP8 violations in examples/src/main/python.
Author: Reynold Xin <rxin@apache.org>

Closes #870 from rxin/examples-python-pep8 and squashes the following commits:

2829e84 [Reynold Xin] Fix PEP8 violations in examples/src/main/python.
2014-05-25 14:48:27 -07:00
Xiangrui Meng bcb9dce6f4 [SPARK-1874][MLLIB] Clean up MLlib sample data
1. Added synthetic datasets for `MovieLensALS`, `LinearRegression`, `BinaryClassification`.
2. Embedded instructions in the help message of those example apps.

Per discussion with Matei on the JIRA page, new example data is under `data/mllib`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #833 from mengxr/mllib-sample-data and squashes the following commits:

59f0a18 [Xiangrui Meng] add sample binary classification data
3c2f92f [Xiangrui Meng] add linear regression data
050f1ca [Xiangrui Meng] add a sample dataset for MovieLensALS example
2014-05-19 21:29:33 -07:00
Andrew Or cf6cbe9f76 [SPARK-1824] Remove <master> from Python examples
A recent PR (#552) fixed this for all Scala / Java examples. We need to do it for python too.

Note that this blocks on #799, which makes `bin/pyspark` go through Spark submit. With only the changes in this PR, the only way to run these examples is through Spark submit. Once #799 goes in, you can use `bin/pyspark` to run them too. For example,

```
bin/pyspark examples/src/main/python/pi.py 100 --master local-cluster[4,1,512]
```

Author: Andrew Or <andrewor14@gmail.com>

Closes #802 from andrewor14/python-examples and squashes the following commits:

cf50b9f [Andrew Or] De-indent python comments (minor)
50f80b1 [Andrew Or] Remove pyFiles from SparkContext construction
c362f69 [Andrew Or] Update docs to use spark-submit for python applications
7072c6a [Andrew Or] Merge branch 'master' of github.com:apache/spark into python-examples
427a5f0 [Andrew Or] Update docs
d32072c [Andrew Or] Remove <master> from examples + update usages
2014-05-16 22:36:23 -07:00
Tathagata Das 68f28dabe9 Fixed streaming examples docs to use run-example instead of spark-submit
Pretty self-explanatory

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #722 from tdas/example-fix and squashes the following commits:

7839979 [Tathagata Das] Minor changes.
0673441 [Tathagata Das] Fixed java docs of java streaming example
e687123 [Tathagata Das] Fixed scala style errors.
9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit.
2014-05-14 04:17:32 -07:00
Sean Owen 2b7bd29eb6 SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure
TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.

I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)

velvia notes:
"I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."

There are at least 3 versions of Netty in play in the build:

- the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
- the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
- but, Spark Core directly uses io.netty:netty-all:4.0.17.Final

The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.

The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.

But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.

If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.

So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:

- Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
- Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
- Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
- Update SBT build accordingly

A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.

Author: Sean Owen <sowen@cloudera.com>

Closes #723 from srowen/SPARK-1789 and squashes the following commits:

43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
2014-05-10 20:50:40 -07:00
Matei Zaharia 7eefc9d2b3 SPARK-1708. Add a ClassTag on Serializer and things that depend on it
This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.

One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.

CC @rxin, @pwendell, @heathermiller

Author: Matei Zaharia <matei@databricks.com>

Closes #700 from mateiz/spark-1708 and squashes the following commits:

1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
3b449ed [Matei Zaharia] test fix
2209a27 [Matei Zaharia] Code style fixes
9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
2014-05-10 12:10:24 -07:00
Evan Sparks 5c5e7d5809 Fixing typo in als.py
XtY should be Xty.

Author: Evan Sparks <evan.sparks@gmail.com>

Closes #696 from etrain/patch-2 and squashes the following commits:

634cb8d [Evan Sparks] Fixing typo in als.py
2014-05-08 13:07:30 -07:00
Prashant Sharma 44dd57fb66 SPARK-1565, update examples to be used with spark-submit script.
Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?

Also few other things that did not work like
`bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`

Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:

669dd23 [Prashant Sharma] Review comments
2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
2014-05-08 10:23:05 -07:00
Evan Sparks 6ed7e2cd01 Use numpy directly for matrix multiply.
Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size.

For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup).
bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10

Author: Evan Sparks <evan.sparks@gmail.com>

Closes #687 from etrain/patch-1 and squashes the following commits:

e094dbc [Evan Sparks] Touching only diaganols on update.
d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply.
2014-05-08 00:24:36 -04:00
Sandeep 108c4c16cc SPARK-1668: Add implicit preference as an option to examples/MovieLensALS
Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/

Author: Sandeep <sandeep@techaddict.me>

Closes #597 from techaddict/SPARK-1668 and squashes the following commits:

8b371dc [Sandeep] Second Pass on reviews by mengxr
eca9d37 [Sandeep] based on mengxr's suggestions
937e54c [Sandeep] Changes
5149d40 [Sandeep] Changes based on review
1dd7657 [Sandeep] use mean()
42444d7 [Sandeep] Based on Suggestions by mengxr
e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
2014-05-08 00:15:05 -04:00
Manish Amde f269b016ac SPARK-1544 Add support for deep decision trees.
@etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.

To summarize:
1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.

cc: @atalwalkar, @hirakendu, @mengxr

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #475 from manishamde/deep_tree and squashes the following commits:

968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-05-07 17:08:38 -07:00
Sandeep fdae095de2 [HOTFIX] SPARK-1637: There are some Streaming examples added after the PR #571 was last updated.
This resulted in Compilation Errors.
cc @mateiz project not compiling currently.

Author: Sandeep <sandeep@techaddict.me>

Closes #673 from techaddict/SPARK-1637-HOTFIX and squashes the following commits:

b512f4f [Sandeep] [SPARK-1637][HOTFIX] There are some Streaming examples added after the PR #571 was last updated. This resulted in Compilation Errors.
2014-05-06 21:55:05 -07:00
Sandeep a000b5c3b0 SPARK-1637: Clean up examples for 1.0
- [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
- [x] Move Python examples into examples/src/main/python
- [x] Update docs to reflect these changes

Author: Sandeep <sandeep@techaddict.me>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #571 from techaddict/SPARK-1637 and squashes the following commits:

47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples
8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples
5f96121 [Sandeep] Move Python examples into examples/src/main/python
0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
2014-05-06 17:27:52 -07:00
Xiangrui Meng 98750a74da [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide
Final pass before the v1.0 release.

* Remove `VectorRDDs`
* Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
* Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
* Clean `DecisionTree` package doc and test suite.
* Mark model constructors `private[spark]`
* Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
* Add `saveAsLibSVMFile`.
* Add `appendBias` to `MLUtils`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #524 from mengxr/mllib-cleaning and squashes the following commits:

295dc8b [Xiangrui Meng] update loadLibSVMFile doc
1977ac1 [Xiangrui Meng] fix doc of appendBias
649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
54b812c [Xiangrui Meng] add appendBias
a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
9b02b93 [Xiangrui Meng] minor code style update
a593ddc [Xiangrui Meng] fix python tests
fc28c18 [Xiangrui Meng] mark more classes experimental
f6cbbff [Xiangrui Meng] fix Java tests
0af70b0 [Xiangrui Meng] minor
6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
c81807f [Xiangrui Meng] set the default value of AddIntercept to false
03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
c66c56f [Xiangrui Meng] move tree md to package object doc
a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
2014-05-05 18:32:54 -07:00
Tathagata Das a975a19f21 [SPARK-1504], [SPARK-1505], [SPARK-1558] Updated Spark Streaming guide
- SPARK-1558: Updated custom receiver guide to match it with the new API
- SPARK-1504: Added deployment and monitoring subsection to streaming
- SPARK-1505: Added migration guide for migrating from 0.9.x and below to Spark 1.0
- Updated various Java streaming examples to use JavaReceiverInputDStream to highlight the API change.
- Removed the requirement for cleaner ttl from streaming guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #652 from tdas/doc-fix and squashes the following commits:

cb4f4b7 [Tathagata Das] Possible fix for flaky graceful shutdown test.
ab71f7f [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into doc-fix
8d6ff9b [Tathagata Das] Addded migration guide to Spark Streaming.
7d171df [Tathagata Das] Added reference to JavaReceiverInputStream in examples and streaming guide.
49edd7c [Tathagata Das] Change java doc links to use Java docs.
11528d7 [Tathagata Das] Updated links on index page.
ff80970 [Tathagata Das] More updates to streaming guide.
4dc42e9 [Tathagata Das] Added monitoring and other documentation in the streaming guide.
14c6564 [Tathagata Das] Updated custom receiver guide.
2014-05-05 15:28:19 -07:00
WangTao 7025dda8fa Handle the vals that never used
In XORShiftRandom.scala, use val "million" instead of constant "1e6.toInt".
Delete vals that never used in other files.

Author: WangTao <barneystinson@aliyun.com>

Closes #565 from WangTaoTheTonic/master and squashes the following commits:

17cacfc [WangTao] Handle the unused assignment, method parameters and symbol inspected by Intellij IDEA
37b4090 [WangTao] Handle the vals that never used
2014-04-29 22:07:20 -07:00
Xiangrui Meng 3f38334f44 [SPARK-1636][MLLIB] Move main methods to examples
* `NaiveBayes` -> `SparseNaiveBayes`
* `KMeans` -> `DenseKMeans`
* `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification`
* `ALS` -> `MovieLensALS`
* `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression`
* `DecisionTree` -> `DecisionTreeRunner`

`scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`.

Example help message:

~~~
BinaryClassification: an example app for binary classification.
Usage: BinaryClassification [options] <input>

  --numIterations <value>
        number of iterations
  --stepSize <value>
        initial step size, default: 1.0
  --algorithm <value>
        algorithm (SVM,LR), default: LR
  --regType <value>
        regularization type (L1,L2), default: L2
  --regParam <value>
        regularization parameter, default: 0.1
  <input>
        input paths to labeled examples in LIBSVM format
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #584 from mengxr/mllib-main and squashes the following commits:

7b58c60 [Xiangrui Meng] minor
6e35d7e [Xiangrui Meng] make imports explicit and fix code style
c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit
6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner
be86069 [Xiangrui Meng] use main instead of extending App
b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples
8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params
fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example
67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example
b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example
b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params
577945b [Xiangrui Meng] remove unused imports from NB
3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification
f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app
01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main
9420692 [Xiangrui Meng] add scopt to examples dependencies
2014-04-29 00:41:03 -07:00
witgo 030f2c2126 Improved build configuration
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <witgo@qq.com>

Closes #480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
2014-04-28 22:51:46 -07:00
Tathagata Das 1d84964bf8 [SPARK-1633][Streaming] Java API unit test and example for custom streaming receiver in Java
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #558 from tdas/more-fixes and squashes the following commits:

c0c84e6 [Tathagata Das] Removing extra println()
d8a8cf4 [Tathagata Das] More tweaks to make unit test work in Jenkins.
b7caa98 [Tathagata Das] More tweaks.
d337367 [Tathagata Das] More tweaks
22d6f2d [Tathagata Das] Merge remote-tracking branch 'apache/master' into more-fixes
40a961b [Tathagata Das] Modified java test to reduce flakiness.
9410ca6 [Tathagata Das] Merge remote-tracking branch 'apache/master' into more-fixes
86d9147 [Tathagata Das] scala style fix
2f3d7b1 [Tathagata Das] Added Scala custom receiver example.
d677611 [Tathagata Das] Merge remote-tracking branch 'apache/master' into more-fixes
bec3fc2 [Tathagata Das] Added license.
51d6514 [Tathagata Das] Fixed docs on receiver.
81aafa0 [Tathagata Das] Added Java test for Receiver API, and added JavaCustomReceiver example.
2014-04-28 13:58:09 -07:00
baishuo(白硕) 8aaef5c756 Update KafkaWordCount.scala
modify the required args number

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #523 from baishuo/master and squashes the following commits:

0368ba9 [baishuo(白硕)] Update KafkaWordCount.scala
2014-04-25 13:18:49 -07:00
Mridul Muralidharan 968c0187a1 SPARK-1586 Windows build fixes
Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.

Author: Mridul Muralidharan <mridulm80@apache.org>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #505 from mridulm/windows_fixes and squashes the following commits:

ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
3267f4b [Mridul Muralidharan] Fix build failures
35b277a [Mridul Muralidharan] Fix Scalastyle failures
bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
1337abd [Mridul Muralidharan] fix classpath while running in windows
2014-04-24 20:48:33 -07:00
Sandeep a03ac222d8 Fix Scala Style
Any comments are welcome

Author: Sandeep <sandeep@techaddict.me>

Closes #531 from techaddict/stylefix-1 and squashes the following commits:

7492730 [Sandeep] Pass 4
98b2428 [Sandeep] fix rxin suggestions
b5e2e6f [Sandeep] Pass 3
05932d7 [Sandeep] fix if else styling 2
08690e5 [Sandeep] fix if else styling
2014-04-24 15:07:23 -07:00
Patrick Wendell cd4ed29326 SPARK-1119 and other build improvements
1. Makes assembly and examples jar naming consistent in maven/sbt.
2. Updates make-distribution.sh to use Maven and fixes some bugs.
3. Updates the create-release script to call make-distribution script.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #502 from pwendell/make-distribution and squashes the following commits:

1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements
2014-04-23 10:19:32 -07:00
Michael Armbrust 39f85e0322 [SQL] SPARK-1571 Mistake in java example code
Author: Michael Armbrust <michael@databricks.com>

Closes #496 from marmbrus/javaBeanBug and squashes the following commits:

644fedd [Michael Armbrust] Bean methods must be public.
2014-04-22 22:19:32 -07:00
Patrick Wendell 83084d3b7b SPARK-1496: Have jarOfClass return Option[String]
A simple change, mostly had to change a bunch of example code.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #438 from pwendell/jar-of-class and squashes the following commits:

aa010ff [Patrick Wendell] SPARK-1496: Have jarOfClass return Option[String]
2014-04-22 00:42:16 -07:00
Tathagata Das 04c37b6f74 [SPARK-1332] Improve Spark Streaming's Network Receiver and InputDStream API [WIP]
The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51

Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability.

Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. This is still not yet implemented.

This PR is blocked on the graceful shutdown PR #247

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #300 from tdas/network-receiver-api and squashes the following commits:

ea27b38 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3a4777c [Tathagata Das] Renamed NetworkInputDStream to ReceiverInputDStream, and ActorReceiver related stuff.
838dd39 [Tathagata Das] Added more events to the StreamingListener to report errors and stopped receivers.
a75c7a6 [Tathagata Das] Address some PR comments and fixed other issues.
91bfa72 [Tathagata Das] Fixed bugs.
8533094 [Tathagata Das] Scala style fixes.
028bde6 [Tathagata Das] Further refactored receiver to allow restarting of a receiver.
43f5290 [Tathagata Das] Made functions that create input streams return InputDStream and NetworkInputDStream, for both Scala and Java.
2c94579 [Tathagata Das] Fixed graceful shutdown by removing interrupts on receiving thread.
9e37a0b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3223e95 [Tathagata Das] Refactored the code that runs the NetworkReceiver into further classes and traits to make them more testable.
a36cc48 [Tathagata Das] Refactored the NetworkReceiver API for future stability.
2014-04-21 19:04:49 -07:00
Sandeep 6ad4c5498d SPARK-1462: Examples of ML algorithms are using deprecated APIs
This will also fix SPARK-1464: Update MLLib Examples to Use Breeze.

Author: Sandeep <sandeep@techaddict.me>

Closes #416 from techaddict/1462 and squashes the following commits:

a43638e [Sandeep] Some Style Changes
3ce69c3 [Sandeep] Fix Ordering and Naming of Imports in Examples
6c7e543 [Sandeep] SPARK-1462: Examples of ML algorithms are using deprecated APIs
2014-04-16 18:23:07 -07:00
Sean Owen 0247b5c546 SPARK-1488. Resolve scalac feature warnings during build
For your consideration: scalac currently notes a number of feature warnings during compilation:

```
[warn] there were 65 feature warning(s); re-run with -feature for details
```

Warnings are like:

```
[warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
[warn] by making the implicit value scala.language.implicitConversions visible.
[warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
[warn] or by setting the compiler option -language:implicitConversions.
[warn] See the Scala docs for value scala.language.implicitConversions for a discussion
[warn] why the feature should be explicitly enabled.
[warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
[warn]                ^
```

scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.

This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.

Author: Sean Owen <sowen@cloudera.com>

Closes #404 from srowen/SPARK-1488 and squashes the following commits:

8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
2014-04-14 19:50:00 -07:00
Sandeep e55cc4bae5 SPARK-1446: Spark examples should not do a System.exit
Spark examples should exit nice using SparkContext.stop() method, rather than System.exit
System.exit can cause issues like in SPARK-1407

Author: Sandeep <sandeep@techaddict.me>

Closes #370 from techaddict/1446 and squashes the following commits:

e9234cf [Sandeep] SPARK-1446: Spark examples should not do a System.exit Spark examples should exit nice using SparkContext.stop() method, rather than System.exit System.exit can cause issues like in SPARK-1407
2014-04-10 00:37:21 -07:00
Kan Zhang eb5f2b6423 SPARK-1407 drain event queue before stopping event logger
Author: Kan Zhang <kzhang@apache.org>

Closes #366 from kanzhang/SPARK-1407 and squashes the following commits:

cd0629f [Kan Zhang] code refactoring and adding test
b073ee6 [Kan Zhang] SPARK-1407 drain event queue before stopping event logger
2014-04-09 15:25:29 -07:00
Xiangrui Meng 9689b663a2 [SPARK-1390] Refactoring of matrices backed by RDDs
This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have

1. `org.apache.spark.mllib.linalg.SparseMatrix`, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format.
2. `org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix`, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format.

We will see naming collision when we introduce local `SparseMatrix`, and the name `TallSkinnyDenseMatrix` is not exact if we switch to `RDD[Vector]` from `RDD[Array[Double]]`. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs.

The proposed names are (all under `org.apache.spark.mllib.linalg.rdd`):

1. `RDDMatrix`: trait for matrices backed by one or more RDDs
2. `CoordinateRDDMatrix`: wrapper of `RDD[(Long, Long, Double)]`
3. `RowRDDMatrix`: wrapper of `RDD[Vector]` whose rows do not have special ordering
4. `IndexedRowRDDMatrix`: wrapper of `RDD[(Long, Vector)]` whose rows are associated with indices

The current code also introduces local matrices.

Author: Xiangrui Meng <meng@databricks.com>

Closes #296 from mengxr/mat and squashes the following commits:

24d8294 [Xiangrui Meng] fix for groupBy returning Iterable
bfc2b26 [Xiangrui Meng] merge master
8e4f1f5 [Xiangrui Meng] Merge branch 'master' into mat
0135193 [Xiangrui Meng] address Reza's comments
03cd7e1 [Xiangrui Meng] add pca/gram to IndexedRowMatrix add toBreeze to DistributedMatrix for test simplify tests
b177ff1 [Xiangrui Meng] address Matei's comments
be119fe [Xiangrui Meng] rename m/n to numRows/numCols for local matrix add tests for matrices
b881506 [Xiangrui Meng] rename SparkPCA/SVD to TallSkinnyPCA/SVD
e7d0d4a [Xiangrui Meng] move IndexedRDDMatrixRow to IndexedRowRDDMatrix
0d1491c [Xiangrui Meng] fix test errors
a85262a [Xiangrui Meng] rename RDDMatrixRow to IndexedRDDMatrixRow
b8b6ac3 [Xiangrui Meng] Remove old code
4cf679c [Xiangrui Meng] port pca to RowRDDMatrix, and add multiply and covariance
7836e2f [Xiangrui Meng] initial refactoring of matrices backed by RDDs
2014-04-08 23:01:15 -07:00
Holden Karau ce8ec54561 Spark 1271: Co-Group and Group-By should pass Iterable[X]
Author: Holden Karau <holden@pigscanfly.ca>

Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:

f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
77048f8 [Holden Karau] Fix merge up to master
d3fe909 [Holden Karau] use toSeq instead
7a092a3 [Holden Karau] switch resultitr to resultiterable
eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
c5075aa [Holden Karau] If guava 14 had iterables
2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
11e730c [Holden Karau] Fix streaming tests
66b583d [Holden Karau] Fix the core test suite to compile
4ed579b [Holden Karau] Refactor from iterator to iterable
d052c07 [Holden Karau] Python tests now pass with iterator pandas
3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
cd1e81c [Holden Karau] Try and make pickling list iterators work
c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
a5ee714 [Holden Karau] oops, was checking wrong iterator
e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
ec8cc3e [Holden Karau] Fix test issues\!
4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
b692868 [Holden Karau] Revert
7e533f7 [Holden Karau] Fix the bug
8a5153a [Holden Karau] Revert me, but we have some stuff to debug
b4e86a9 [Holden Karau] Add a join based on the problem in SVD
c4510e2 [Holden Karau] Revert this but for now put things in list pandas
b4e0b1d [Holden Karau] Fix style issues
71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
37888ec [Holden Karau] core/tests now pass
249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
fe992fe [Holden Karau] hmmm try and fix up basic operation suite
172705c [Holden Karau] Fix Java API suite
caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
4991af6 [Holden Karau] Fix some tests
be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
2014-04-08 18:15:59 -07:00
Sean Owen 856c50f59b SPARK-1387. Update build plugins, avoid plugin version warning, centralize versions
Another handful of small build changes to organize and standardize a bit, and avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
- Use variables to define versions across dependencies where they should move in lock step
- ... and make this consistent between Maven/SBT

OK, I also updated the JIRA URL while I was at it here.

Author: Sean Owen <sowen@cloudera.com>

Closes #291 from srowen/SPARK-1387 and squashes the following commits:

461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
2014-04-06 17:41:01 -07:00
Michael Armbrust 8de038eb36 [SQL] SPARK-1366 Consistent sql function across different types of SQLContexts
Now users who want to use HiveQL should explicitly say `hiveql` or `hql`.

Author: Michael Armbrust <michael@databricks.com>

Closes #319 from marmbrus/standardizeSqlHql and squashes the following commits:

de68d0e [Michael Armbrust] Fix sampling test.
fbe4a54 [Michael Armbrust] Make `sql` always use spark sql parser, users of hive context can now use hql or hiveql to run queries using HiveQL instead.
2014-04-04 21:15:33 -07:00
Haoyuan Li b50ddfde03 SPARK-1305: Support persisting RDD's directly to Tachyon
Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
2014-04-04 20:38:20 -07:00
Michael Armbrust b8f534196f [SQL] SPARK-1333 First draft of java API
WIP: Some work remains...
 * [x] Hive support
 * [x] Tests
 * [x] Update docs

Feedback welcome!

Author: Michael Armbrust <michael@databricks.com>

Closes #248 from marmbrus/javaSchemaRDD and squashes the following commits:

b393913 [Michael Armbrust] @srowen 's java style suggestions.
f531eb1 [Michael Armbrust] Address matei's comments.
33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
822f626 [Michael Armbrust] improve docs.
ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
0b859c8 [Michael Armbrust] First draft of java API.
2014-04-03 15:45:34 -07:00
Xiangrui Meng 9c65fa76f9 [SPARK-1212, Part II] Support sparse data in MLlib
In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:

1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
4. Add libSVMFile to MLContext.
5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
6. Gradient computation no longer creates temp vectors.
7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:
1. ~~Use axpy when possible.~~
2. ~~Optimize Naive Bayes.~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #245 from mengxr/vector and squashes the following commits:

eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
11999c7 [Xiangrui Meng] Merge branch 'master' into vector
f7da54b [Xiangrui Meng] add minSplits to libSVMFile
da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
493f26f [Xiangrui Meng] Merge branch 'master' into vector
7c1bc01 [Xiangrui Meng] add a TODO to NB
b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
4addc50 [Xiangrui Meng] merge master
4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
d088552 [Xiangrui Meng] use static constructor for MLContext
6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
0f8759b [Xiangrui Meng] minor updates to NB
b11659c [Xiangrui Meng] style update
78c4671 [Xiangrui Meng] add libSVMFile to MLContext
f0fe616 [Xiangrui Meng] add a test for sparse linear regression
44733e1 [Xiangrui Meng] use in-place gradient computation
e981396 [Xiangrui Meng] use axpy in Updater
db808a1 [Xiangrui Meng] update JavaLR example
befa592 [Xiangrui Meng] passed scala/java tests
75c83a4 [Xiangrui Meng] passed test compile
1859701 [Xiangrui Meng] passed compile
834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
135ab72 [Xiangrui Meng] merge glm
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
2014-04-02 14:01:12 -07:00
Prashant Sharma d666053679 SPARK-1352 - Comment style single space before ending */ check.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits:

6cde61e [Prashant Sharma] comment style space before ending */ check.
2014-03-30 10:06:56 -07:00
Prashant Sharma 60abc25254 SPARK-1096, a space after comment start style checker.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits:

214135a [Prashant Sharma] Review feedback.
5eba88c [Prashant Sharma] Fixed style checks for ///+ comments.
e54b2f8 [Prashant Sharma] improved message, work around.
83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version.
810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker.
ba33193 [Prashant Sharma] scala style as a project
2014-03-28 00:21:49 -07:00
Xiangrui Meng 80c29689ae [SPARK-1212] Adding sparse data support and update KMeans
Continue our discussions from https://github.com/apache/incubator-spark/pull/575

This PR is WIP because it depends on a SNAPSHOT version of breeze.

Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.

@fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.

I'm going to update this PR to include:

1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.

2. Some numbers about the performance.

3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!

Author: Xiangrui Meng <meng@databricks.com>

Closes #117 from mengxr/sparse-kmeans and squashes the following commits:

67b368d [Xiangrui Meng] fix SparseVector.toArray
5eda0de [Xiangrui Meng] update NOTICE
67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
b28ba2f [Xiangrui Meng] add toArray to Vector
e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
72bde33 [Xiangrui Meng] clean up code for distance computation
712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
27858e4 [Xiangrui Meng] update breeze version to 0.7
07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
42b4e50 [Xiangrui Meng] line feed at the end
a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
0107e19 [Xiangrui Meng] update NOTICE
87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
4e7d5ca [Xiangrui Meng] minor style update
07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
2014-03-23 17:34:02 -07:00
Michael Armbrust 9aadcffabd SPARK-1251 Support for optimizing and executing structured queries
This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL.

*This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.*

The code is broken into three primary components:
 - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
 - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs.  This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files.
 - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes.  There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.

A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251).

[An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review.

With this PR comes support for inferring the schema of existing RDDs that contain case classes.  Using this information, developers can now express structured queries that are automatically compiled into RDD operations.

```scala
// Define the schema using a case class.
case class Person(name: String, age: Int)
val people: RDD[Person] =
  sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt))

// The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19'
val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd
```

RDDs can also be registered as Tables, allowing SQL queries to be written over them.
```scala
people.registerAsTable("people")
val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19")
```

The results of queries are themselves RDDs and support standard RDD operations:
```scala
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
```

Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL.
```scala
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT key, value FROM src").collect().foreach(println)
```

## Relationship to Shark

Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huaiyin.thu@gmail.com>
Author: Reynold Xin <rxin@apache.org>
Author: Lian, Cheng <rhythm.mail@gmail.com>
Author: Andre Schumacher <andre.schumacher@iki.fi>
Author: Yin Huai <huai@cse.ohio-state.edu>
Author: Timothy Chen <tnachen@gmail.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Timothy Chen <tnachen@apache.org>
Author: Henry Cook <henry.m.cook+github@gmail.com>
Author: Mark Hamstra <markhamstra@gmail.com>

Closes #146 from marmbrus/catalyst and squashes the following commits:

458bd1b [Michael Armbrust] Update people.txt
0d638c3 [Michael Armbrust] Typo fix from @ash211.
bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests
778299a [Michael Armbrust] Fix some old links to spark-project.org
fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations.  This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user.  This change also makes it slightly less verbose to run language integrated queries.
fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD.
48a99bc [Michael Armbrust] Address first round of feedback.
461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour
adcf1a4 [Henry Cook] Update sql-programming-guide.md
9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins.
6978dd8 [Michael Armbrust] update docs, add apache license
1d0eb63 [Michael Armbrust] update changes with spark core
e5e1d6b [Michael Armbrust] Remove travis configuration.
c2efad6 [Michael Armbrust] First draft of SQL documentation.
013f62a [Michael Armbrust] Fix documentation / code style.
c01470f [Michael Armbrust] Clean up example
2f22454 [Michael Armbrust] WIP: Parquet example.
ce8073b [Michael Armbrust] clean up implicits.
f7d992d [Michael Armbrust] Naming / spelling.
9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext.
d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar.  Create a separate, optional Hive assembly that is used when present.
8b35e0a [Michael Armbrust] address feedback, work on DSL
5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes
f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation
1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven
3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes
3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context.
7233a74 [Michael Armbrust] initial support for maven builds
f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven
7386a9f [Michael Armbrust] Initial example programs using spark sql.
aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow
7ca4b4e [Andre Schumacher] Improving checks in Parquet tests
5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport
54637ec [Andre Schumacher] First part of second round of code review feedback
c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows
ba28849 [Michael Armbrust] code review comments.
d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization.
9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates.  Also add a constructor so that it can be serialized out-of-the-box.
959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream.
d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path.
c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support
3c3f962 [Michael Armbrust] Fix a bug due to array reuse.  This will need to be revisited after we merge the mutable row PR.
7d0f13e [Michael Armbrust] Update parquet support with master.
9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support
0040ae6 [Andre Schumacher] Feedback from code review
1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow
70e489d [Cheng Lian] Fixed a spelling typo
6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object.
8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly
99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval
7b9d142 [Michael Armbrust] Update travis to increase permgen size.
da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS.
6fdefe6 [Michael Armbrust] Port sbt improvements from master.
296fe50 [Michael Armbrust] Address review feedback.
d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework.
3bda72d [Andre Schumacher] Adding license banner to new files
3ac9eb0 [Andre Schumacher] Rebasing to new main branch
c863bed [Andre Schumacher] Codestyle checks
61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite
3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala
3a0a552 [Andre Schumacher] Reorganizing Parquet table operations
18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables
75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies
f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection
6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan
0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport
a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport
6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core
eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase
99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types
b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types
c334386 [Michael Armbrust] Initial support for generating schema's based on case classes.
608a29e [Michael Armbrust] Add hive as a repl dependency
7413ac2 [Michael Armbrust] make test downloading quieter.
4d57d0e [Michael Armbrust] Fix test execution on travis.
5f2963c [Michael Armbrust] naming and continuous compilation fixes.
f5e7492 [Michael Armbrust] Add Apache license.  Make naming more consistent.
3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject.
2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes
d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query.
24eaa79 [Michael Armbrust] fix > 100 chars
6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL.
df88f01 [Michael Armbrust] add a simple test for aggregation
18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data.
b922511 [Michael Armbrust] Fix insertion of nested types into hive tables.
5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function.
a430895 [Michael Armbrust] Planning for logical Repartition operators.
532dd37 [Michael Armbrust] Allow the local warehouse path to be specified.
4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter.
8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package.
c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation.
29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables.
9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning
f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe
cf4db59 [Lian, Cheng] Added golden answers for PruningSuite
54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names
2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning
c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning
f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query.
128a9f8 [Yin Huai] Minor changes.
017872c [Yin Huai] Remove stats20 from whitelist.
a1a4776 [Yin Huai] Update comments.
feb022c [Yin Huai] Partitioning key should be case insensitive.
555fb1d [Yin Huai] Correctly set the extension for a text file.
d00260b [Yin Huai] Strips backticks from partition keys.
334aace [Yin Huai] New golden files.
a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query.
428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`.
eea75c5 [Yin Huai] Correctly set codec.
45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
e089627 [Yin Huai] Code style.
563bb22 [Yin Huai] Set compression info in FileSinkDesc.
35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback
bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables.
5495fab [Yin Huai] Remove cloneRecords which is no longer needed.
1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy.
3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location.
8506c17 [Michael Armbrust] Address review feedback.
3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master
9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling
566fd66 [Timothy Chen] Whitelist tests and add support for Binary type
69adf72 [Yin Huai] Set cloneRecords to false.
a9c3188 [Timothy Chen] Fix udaf struct return
346f828 [Yin Huai] Move SharkHadoopWriter to the correct location.
59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
ed3a1d1 [Yin Huai] Load data directly into Hive.
7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT.
b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile
1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile
5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii
678341a [Mark Hamstra] Replaced non-ascii text
887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents
7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package.
bc9a12c [Michael Armbrust] Move hive test files.
5720d2b [Lian, Cheng] Fixed comment typo
f0c3742 [Lian, Cheng] Refactored PhysicalOperation
f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered
cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings
2407a21 [Lian, Cheng] Added optimized logical plan to debugging output
a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens
9329820 [Michael Armbrust] add golden answer files to repository
dce0593 [Michael Armbrust] move golden answer to the source code directory.
964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView
7785ee6 [Michael Armbrust] Tighten visibility based on comments.
341116c [Michael Armbrust] address comments.
0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS
2897deb [Michael Armbrust] fix scaladoc
7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries.
b376d15 [Michael Armbrust] fix newlines at EOF
5cc367c [Michael Armbrust] use berkeley instead of cloudbees
ff5ea3f [Michael Armbrust] new golden
db92adc [Michael Armbrust] more tests passing. clean up logging.
740febb [Michael Armbrust] Tests for tgfs.
0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf.
ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView
dd00b7e [Michael Armbrust] initial implementation of generators.
ea76cf9 [Michael Armbrust] Add NoRelation to planner.
bea4b7f [Michael Armbrust] Add SumDistinct.
016b489 [Michael Armbrust] fix typo.
acb9566 [Michael Armbrust] Correctly type attributes of CTAS.
8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation.
02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query.
5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes
5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg
8017afb [Michael Armbrust] fix copy paste error.
dc6353b [Michael Armbrust] turn off deprecation
cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance.
883006d [Michael Armbrust] improve tests.
32b615b [Michael Armbrust] add override to asPartial.
e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe.
f94345c [Michael Armbrust] fix doc link
d8cb805 [Michael Armbrust] Implement partial aggregation.
ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings.  Remove a blank line.
b4be6a5 [Michael Armbrust] better logging when applying rules.
67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex
cb57459 [Michael Armbrust] blacklist machine specific test.
2f27604 [Michael Armbrust] Address comments / style errors.
389525d [Michael Armbrust] update golden, blacklist mr.
e3c10bd [Michael Armbrust] update whitelist.
44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex
42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs.
ab5bff3 [Michael Armbrust] Support for get item of map types.
1679554 [Michael Armbrust] add toString for if and IS NOT NULL.
ab9a131 [Michael Armbrust] when UDFs fail they should return null.
25288d0 [Michael Armbrust] Implement [] for arrays and maps.
e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions.
010accb [Michael Armbrust] add tinyint to metastore type parser.
7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes.
ac9d7de [Michael Armbrust] Resolve *s in Transform clauses.
692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables.
92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects
9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector.
72a003d [Michael Armbrust] revert regex change
7661b6c [Michael Armbrust] blacklist machines specific tests
aa430e7 [Michael Armbrust] Update .travis.yml
e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs.
5e54aa6 [Michael Armbrust] quotes for struct field names.
bbec500 [Michael Armbrust] update test coverage, new golden
3734a94 [Michael Armbrust] only quote string types.
3f9e519 [Michael Armbrust] use names w/ boolean args
5b3d2c8 [Michael Armbrust] implement distinct.
5b33216 [Michael Armbrust] work on decimal support.
2c6deb3 [Michael Armbrust] improve printing compatibility.
35a70fb [Michael Armbrust] multi-letter field names.
a9388fb [Michael Armbrust] printing for map types.
c3feda7 [Michael Armbrust] use toArray.
c654f19 [Michael Armbrust] Support for list and maps in hive table scan.
cf8d992 [Michael Armbrust] Use built in functions for creating temp directory.
1579eec [Michael Armbrust] Only cast unresolved inserts.
6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression.
da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test.  Not sure how this was passing before...
6709441 [Michael Armbrust] Evaluation for accessing nested fields.
dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation.
d670e41 [Michael Armbrust] Print nested fields like hive does.
efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan.
9c22b4e [Michael Armbrust] Support for parsing nested types.
82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables
ea6f37f [Michael Armbrust] fix style.
7845364 [Michael Armbrust] deactivate concurrent test.
b649c20 [Michael Armbrust] fix test logging / caching.
1590568 [Michael Armbrust] add log4j.properties
19bfd74 [Michael Armbrust] store hive output in circular buffer
dfb67aa [Michael Armbrust] add test case
cb775ac [Michael Armbrust] get rid of SharkContext singleton
2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master
63003e9 [Michael Armbrust] Fix spacing.
41b41f3 [Michael Armbrust] Only cast unresolved inserts.
6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs
5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator
b1151a8 [Timothy Chen] Fix load data regex
8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API.
e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction
45b334b [Yin Huai] fix comments
235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning.
6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style
271e483 [Michael Armbrust] Update build status icon.
d3a3d48 [Michael Armbrust] add testing to travis
807b2d7 [Michael Armbrust] check style and publish docs with travis
d20b565 [Michael Armbrust] fix if style
bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide.
d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests.  This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive.  These are used by default when HIVE_HOME is not set.
f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin
41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file).
5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference
7213a2c [Reynold Xin] style fix for Hive.scala.
08e4d05 [Reynold Xin] First round of style cleanup.
605255e [Reynold Xin] Added scalastyle checker.
61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases
2486fb7 [Lian, Cheng] Fixed spelling
8ee41be [Lian, Cheng] Minor refactoring
ebb56fa [Michael Armbrust] add travis config
4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests
d4f539a [Michael Armbrust] blacklist mr and user specific tests.
677eb07 [Michael Armbrust] Update test whitelist.
5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning
c263c84 [Michael Armbrust] Only push predicates into partitioned table scans.
ab77882 [Michael Armbrust] upgrade spark to RC5.
c98ede5 [Lian, Cheng] Response to comments from @marmbrus
83d4520 [Yin Huai] marmbrus's comments
70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes
9ebff47 [Yin Huai] remove unnecessary .toSeq
e811d1a [Yin Huai] markhamstra's comments
4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations.
040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators.
9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions.
e9347fc [Michael Armbrust] Remove broken scaladoc links.
99c6707 [Michael Armbrust] upgrade spark
57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable
cb49af0 [Lian, Cheng] Fixed Scaladoc links
4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion
111ffdc [Lian, Cheng] More comments and minor reformatting
9e0d840 [Lian, Cheng] Added partition pruning optimization
761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan
04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
9dd3b26 [Michael Armbrust] Fix scaladoc.
6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst
7c92a41 [Lian, Cheng] Added Hive SerDe support
ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
2957f31 [Yin Huai] addressed comments on PR
907db68 [Michael Armbrust] Space after while.
04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts
4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile
5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23
fd084a4 [Michael Armbrust] implement casts binary <=> string.
0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type
548e479 [Yin Huai] merge master into exchangeOperator and fix code style
5b11db0 [Reynold Xin] Added Void to Boolean type widening.
9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear.
9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion
a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20
b20a4d4 [Lian, Cheng] Fix issue #20
6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc.
4285962 [Michael Armbrust] Remove temporary test cases
167162f [Michael Armbrust] more merge errors, cleanup.
e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge.
6377d0b [Michael Armbrust] Drop empty files, fix if ().
c0b0e60 [Michael Armbrust] cleanup broken doc links.
330a88b [Michael Armbrust] Fix bugs in AddExchange.
4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering.
043e296 [Michael Armbrust] Make physical union nodes variadic.
ece15e1 [Michael Armbrust] update unit tests
5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width.
9804eb5 [Michael Armbrust] upgrade spark
053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow
5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow
ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes
bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg
563053f [Michael Armbrust] Address @rxin's comments.
6537c66 [Michael Armbrust] Address @rxin's comments.
2a76fc6 [Michael Armbrust] add notes from @rxin.
685bfa1 [Michael Armbrust] fix spelling
69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions.
7859a86 [Michael Armbrust] Remove SparkAggregate.  Its kinda broken and breaks RDD lineage.
fc22e01 [Michael Armbrust] whitelist newly passing union test.
3f547b8 [Michael Armbrust] Add support for widening types in unions.
53b95f8 [Michael Armbrust] coercion should not occur until children are resolved.
b892e32 [Michael Armbrust] Union is not resolved until the types match up.
95ab382 [Michael Armbrust] Use resolved instead of custom function.  This is better because some nodes override the notion of resolved.
81a109d [Michael Armbrust] fix link.
f143f61 [Michael Armbrust] Implement sampling.  Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!"
6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar.
c800798 [Michael Armbrust] Add build status icon.
0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown
05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row.
f2fdd77 [Michael Armbrust] fix required distribtion for aggregate.
658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala
583a337 [Michael Armbrust] break apart distribution and partitioning.
e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator
0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links.
73c70de [Yin Huai] add a first set of unit tests for data properties.
fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements.
2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans.
fcbc03b [Michael Armbrust] Fix if ().
7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators.
b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes
b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization.
e286d20 [Michael Armbrust] address code review comments.
80d0681 [Michael Armbrust] fix scaladoc links.
de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString.
3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution.
2abb0bc [Michael Armbrust] better debug messages, use exists.
098dfc4 [Michael Armbrust] Implement Long sorting again.
60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable.
a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString.
dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes.
b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen
83adb9d [Yin Huai] add DataProperty
5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics.
f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator.  This can probably also be use for codegen...
6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet.
ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts.
058ec15 [Michael Armbrust] handle more writeables.
ffa9f25 [Michael Armbrust] blacklist some more MR tests.
aa2239c [Michael Armbrust] filter test lines containing Owner:
f71a325 [Michael Armbrust] Update golden jar.
a3003ae [Michael Armbrust] Update makefile to use better sharding support.
568d150 [Michael Armbrust] Updates to white/blacklist.
8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right.
c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure.  See comments for details.
09c6300 [Michael Armbrust] Add nullability information to StructFields.
5460b2d [Michael Armbrust] load srcpart by default.
3695141 [Michael Armbrust] Lots of parser improvements.
965ac9a [Michael Armbrust] Add expressions that allow access into complex types.
3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences.
8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning.
e57f97a [Michael Armbrust] more decimal/null support.
e1440ed [Michael Armbrust] Initial support for function specific type conversions.
1814ed3 [Michael Armbrust] use childrenResolved function.
f2ec57e [Michael Armbrust] Begin supporting decimal.
6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs
7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters.
d0124f3 [Michael Armbrust] Correctly type null literals.
b65626e [Michael Armbrust] Initial support for parsing BigDecimal.
a90efda [Michael Armbrust] utility function for outputing string stacktraces.
7102f33 [Michael Armbrust] methods with side-effects should use ().
3ccaef7 [Michael Armbrust] add renaming TODO.
bc282c7 [Michael Armbrust] fix bug in getNodeNumbered
c8e89d5 [Michael Armbrust] memoize inputSet calculation.
6aefa46 [Michael Armbrust] Skip folding literals.
a72e540 [Michael Armbrust] Add IN operator.
04f885b [Michael Armbrust] literals are only non-nullable if they are not null.
35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output.
12fd52d [Michael Armbrust] support for sorting longs.
0606520 [Michael Armbrust] drop old comment.
859200a [Michael Armbrust] support for reading more types from the metastore.
1fedd18 [Michael Armbrust] coercion from null to numeric types
71e902d [Michael Armbrust] fix test cases.
cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer
8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment
86355a6 [Michael Armbrust] throw error if there are unexpected join clauses.
c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute.
0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling
a92919d [Michael Armbrust] add alter view as to native commands
f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT
f0faa26 [Michael Armbrust] add sample and distinct operators.
ef7b943 [Michael Armbrust] add metastore support for float
e9f4588 [Michael Armbrust] fix > 100 char.
755b229 [Michael Armbrust] blacklist some ddl tests.
9ae740a [Michael Armbrust] blacklist more tests that require MR.
4cfc11a [Michael Armbrust] more test coverage.
0d9d56a [Michael Armbrust] add more native commands to parser
78d730d [Michael Armbrust] Load src test table on RESET.
8364ec2 [Michael Armbrust] whitelist all possible partition values.
b01468d [Michael Armbrust] support path rewrites when the query begins with a comment.
4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail.
4c5fb0f [Michael Armbrust] makefile target for building new whitelist.
4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO.
516481c [Michael Armbrust] Ignore requests to explain native commands.
68aa2e6 [Michael Armbrust] Stronger type for Token extractor.
ca4ea26 [Michael Armbrust] Support for parsing UDF(*).
1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset.
9627616 [Michael Armbrust] Use current database as default database.
9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode.
6f64cee [Michael Armbrust] don't line wrap string literal
eafaeed [Michael Armbrust] add type documentation
f54c94c [Michael Armbrust] make golden answers file a test dependency
5362365 [Michael Armbrust] push conditions into join
0d2388b [Michael Armbrust] Point at databricks hosted scaladoc.
73b29cd [Michael Armbrust] fix bad casting
9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes
7eff191 [Michael Armbrust] link all the expression names.
83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules
9de6b74 [Michael Armbrust] fix language feature and deprecation warnings.
0b1960a [Michael Armbrust] Fix broken scala doc links / warnings.
b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions
01c00c2 [Michael Armbrust] new golden
5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching
66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork
1a393da [Yin Huai] folded -> foldable
1e964ea [Yin Huai] update
a43d41c [Michael Armbrust] more tests passing!
8ca38d0 [Michael Armbrust] begin support for varchar / binary types.
ab8bbd1 [Michael Armbrust] parsing % operator
c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests.
3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline.
5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
367fb9e [Yin Huai] update
0cd5cc6 [Michael Armbrust] add BIGINT cast parsing
61b266f [Michael Armbrust] comment for eliminate subqueries.
d72a5a2 [Michael Armbrust] add long to literal factory object.
b3bd15f [Michael Armbrust] blacklist more mr requiring tests.
e06fd38 [Michael Armbrust] black list map reduce tests.
8e7ce30 [Michael Armbrust] blacklist some env specific tests.
6250cbd [Michael Armbrust] Do not exit on test failure
b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath.
b6e4899 [Yin Huai] formatting
e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12
5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing?
9e190f5 [Michael Armbrust] drop unneeded ()
68b58c1 [Michael Armbrust] drop a few more tests.
b0aa400 [Michael Armbrust] update whitelist.
c99012c [Michael Armbrust] skip tests with hooks
db00ebf [Michael Armbrust] more types for hive udfs
dbc3678 [Michael Armbrust] update ghpages repo
138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure.
6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests.
46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel.  usage: "make -j 8 -i"
8d47ed4 [Yin Huai] comment
2795f05 [Yin Huai] comment
e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer
2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
0bd1688 [Yin Huai] update
6a7bd75 [Michael Armbrust] fix partition column delimiter configuration.
e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0.
b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean
52864da [Reynold Xin] Added executeCollect method to SharkPlan.
f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan.
b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException.
38124bd [Yin Huai] formatting
2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively
555d839 [Reynold Xin] More cleaning ...
d48d0e1 [Reynold Xin] Code review feedback.
aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency.
479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if)
da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename
e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope.
72426ed [Reynold Xin] Rename shark2 package to execution.
0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename
e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore
3f9fee1 [Michael Armbrust] rewrite push filter through join optimization.
c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory.
c9777d8 [Reynold Xin] Put all source files in a catalyst directory.
019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files.
80ca4be [Timothy Chen] Address comments
0079392 [Michael Armbrust] support for multiple insert commands in a single query
75b5a01 [Michael Armbrust] remove space.
4283400 [Timothy Chen] Add limited predicate push down
e547e50 [Michael Armbrust] implement First.
e77c9b6 [Michael Armbrust] more work on unique join.
c795e06 [Michael Armbrust] improve star expansion
a26494e [Michael Armbrust] allow aliases to have qualifiers
d078333 [Michael Armbrust] remove extra space
a75c023 [Michael Armbrust] implement Coalesce
3a018b6 [Michael Armbrust] fix up docs.
ab6f67d [Michael Armbrust] import the string "null" as actual null.
5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved.
191ce3e [Michael Armbrust] analyze rewrite test query.
60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved.
2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues.
e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD
c086a35 [Michael Armbrust] docs, spacing
c4060e4 [Michael Armbrust] cleanup
3b85462 [Michael Armbrust] more tests passing
bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data.
c944a95 [Michael Armbrust] First aggregate expression.
1e28311 [Michael Armbrust] make tests execute in alpha order again
a287481 [Michael Armbrust] spelling
8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing.
a6ab6c7 [Michael Armbrust] add !=
4529594 [Michael Armbrust] draft of coalesce
70f253f [Michael Armbrust] more tests passing!
7349e7b [Michael Armbrust] initial support for test thrift table
d3c9305 [Michael Armbrust] fix > 100 char line
93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE"
06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths
6355d0e [Michael Armbrust] match actual return type of count with expected
cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty.
fd4b096 [Michael Armbrust] fix casing of null strings as well.
4632695 [Michael Armbrust] support for megastore bigint
67b88cf [Michael Armbrust] more verbose debugging of evaluation return types
c680e0d [Michael Armbrust] Failed string => number conversion should return null.
2326be1 [Michael Armbrust] make getClauses case insensitive.
dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types.
045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
fb5ddfd [Michael Armbrust] move ViewExamples to examples/
83833e8 [Michael Armbrust] more tests passing!
47c98d6 [Michael Armbrust] add query tests for like and hash.
1724c16 [Michael Armbrust] clear lines that contain last updated times.
cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse.
9b2642b [Michael Armbrust] make the blacklist support regexes
1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs
910e33e [Michael Armbrust] basic support for building an assembly jar.
d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore.
495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf.
65f4e69 [Michael Armbrust] remove incorrect comments
0831a3c [Michael Armbrust] support for parsing some operator udfs.
6c27aa7 [Michael Armbrust] more cast parsing.
43db061 [Michael Armbrust] significant generalization of hive udf functionality.
3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines.
e5690a6 [Michael Armbrust] add BinaryType
adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called.
d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]).
8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage.
21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function.
0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons.
2b70abf [Michael Armbrust] true and false literals.
ef8b0a5 [Michael Armbrust] more tests.
14d070f [Michael Armbrust] add support for correctly extracting partition keys.
0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
69a0bd4 [Michael Armbrust] promote strings in predicates with number too.
3946e31 [Michael Armbrust] don't build strings unless assertion fails.
90c453d [Michael Armbrust] more tests passing!
6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting.
8000504 [Michael Armbrust] Improve type coercion.
9087152 [Michael Armbrust] fix toString of Not.
58b111c [Michael Armbrust] fix bad scaladoc tag.
d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there.
ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness.
1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types.
d873b2b [Yin Huai] Remove numbers associated with test cases.
8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown
d1e7b8e [Michael Armbrust] Update README.md
c8b1553 [Michael Armbrust] Update README.md
9307ef9 [Michael Armbrust] update list of passing tests.
934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers.
a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now.
ae0024a [Yin Huai] update
cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions
21976ae [Yin Huai] update
b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions
dedbf0c [Yin Huai] support Boolean literals
eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals
37817b5 [Yin Huai] add a comment to EvaluateLiterals.
468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
b1d1843 [Michael Armbrust] more work on big data benchmark tests.
cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark
7d7fa9f [Michael Armbrust] support for create table as
5f54f03 [Michael Armbrust] parsing for ASC
d42b725 [Michael Armbrust] Sum of strings requires cast
34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.)
81659cb [Michael Armbrust] implement transform operator.
5cd76d6 [Michael Armbrust] break up the file based test case code for reuse
1031b65 [Michael Armbrust] support for case insensitive resolution.
320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots)
b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt
d9d18b4 [Michael Armbrust] debug logging implicit.
669089c [Yin Huai] support Boolean literals
ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals
73a05fd [Yin Huai] add a comment to EvaluateLiterals.
191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
80039cc [Yin Huai] Merge pull request #1 from yhuai/master
cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide
5c518e4 [Michael Armbrust] fix bug in test.
b50dd0e [Michael Armbrust] fix return type of overloaded method
05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview.
8c60cc0 [Michael Armbrust] Update README.md
03b9526 [Michael Armbrust] First draft of optimizer tests.
f392755 [Michael Armbrust] Add flatMap to TreeNode
6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings
15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions.
06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression.
4b0a888 [Michael Armbrust] implement string comparison and more casts.
356b321 [Michael Armbrust] Update README.md
3776395 [Michael Armbrust] Update README.md
304d17d [Michael Armbrust] Create README.md
b7d8be0 [Michael Armbrust] more tests passing.
b82481f [Michael Armbrust] add todo comment.
02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist.
cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support.
c43a259 [Michael Armbrust] comments
15ff448 [Michael Armbrust] better error message when a dsl test throws an exception
76ec650 [Michael Armbrust] fix join conditions
e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan.
91573a4 [Michael Armbrust] initial type promotion
e2ef4a5 [Michael Armbrust] logging
e43dc1e [Michael Armbrust] add string => int cast evaluation
f1f7e96 [Michael Armbrust] fix incorrect generation of join keys
2b27230 [Michael Armbrust] add depth based subtree access
0f6279f [Michael Armbrust] broken tests.
389bc0b [Michael Armbrust] support for partitioned columns in output.
12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name.
b67a225 [Michael Armbrust] better errors when types don't match up.
9e74808 [Michael Armbrust] add children resolved.
6d03ce9 [Michael Armbrust] defaults for unresolved relation
2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions
be5ae2c [Michael Armbrust] better resolution logging
cb7b5af [Michael Armbrust] views example
420e05b [Michael Armbrust] more tests passing!
6916c63 [Michael Armbrust] Reading from partitioned hive tables.
a1245f9 [Michael Armbrust] more tests passing
956e760 [Michael Armbrust] extended explain
5f14c35 [Michael Armbrust] more test tables supported
175c43e [Michael Armbrust] better errors for parse exceptions
480ade5 [Michael Armbrust] don't use partial cached results.
8a9d21c [Michael Armbrust] fix evaluation
7aee69c [Michael Armbrust] parsing for joins, boolean logic
7fcf480 [Michael Armbrust] test for and logic
3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines.
6902490 [Michael Armbrust] fix boolean logic evaluation
4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic
8b2a2ee [Michael Armbrust] more tests passing!
ad1f3b4 [Michael Armbrust] toString for null literals
a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work)
60ec19d [Michael Armbrust] initial support for udfs
c45b440 [Michael Armbrust] support for is (not) null and boolean logic
7f4a1dc [Michael Armbrust] add NoRelation logical operator
72e183b [Michael Armbrust] support for null values in tree node args.
ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs
e5c9d1a [Michael Armbrust] use nonEmpty
dcc4fe1 [Michael Armbrust] support for src1 test table.
c78b587 [Michael Armbrust] casting.
75c3f3f [Michael Armbrust] add support for logging with scalalogging.
da2c011 [Michael Armbrust] make it more obvious when results are being truncated.
96b73ba [Michael Armbrust] more docs in TestShark
18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive.
e6d063b [Michael Armbrust] more join tests.
664c1c3 [Michael Armbrust] make parsing of function names case insensitive.
0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome.
1a6db68 [Michael Armbrust] spelling
7638cb4 [Michael Armbrust] simple join execution with dsl tests.  no hive tests yes.
859d4c9 [Michael Armbrust] better argString printing of nested trees.
fc53615 [Michael Armbrust] add same instance comparisons for tree nodes.
a026e6b [Michael Armbrust] move out hive specific operators
fff4d1c [Michael Armbrust] add simple query execution debugging
e2120ab [Michael Armbrust] sorting for strings
da06eb6 [Michael Armbrust] Parsing for sortby and joins
9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId.
8eb2460 [Michael Armbrust] add system property to override whitelist.
88124bb [Michael Armbrust] make strategy evaluation lazy.
74a3a21 [Michael Armbrust] implement outputSet
d25b171 [Michael Armbrust] Add AND and OR expressions
67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll
12acf0a [Michael Armbrust] add .DS_Store for macs
f7da6ce [Michael Armbrust] add agg with grouping expr in select test
36805b3 [Michael Armbrust] pull out and improve aggregation
75613e1 [Michael Armbrust] better evaluations failure messages.
4789a35 [Michael Armbrust] weaken type since its hard to create pure references.
e89dd36 [Michael Armbrust] no newline for online trees
d0590d4 [Michael Armbrust] include stack trace for catalyst failures.
081c0d9 [Michael Armbrust] more generic computation of agg functions.
31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser
ecd45b2 [Michael Armbrust] Add more passing tests.
97d5419 [Michael Armbrust] fix alignment.
565cc13 [Michael Armbrust] make the canary query optional.
a95e65c [Michael Armbrust] support for resolving qualified attribute references.
e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails.
4640a0b [Michael Armbrust] handle test tables when database is specified.
bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis.
fec5158 [Michael Armbrust] add hive / idea files to .gitignore
3f97ffe [Michael Armbrust] Rename Hive => HiveQl
656b836 [Michael Armbrust] Support for parsing select clause aliases.
3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs.
3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception.
8cbef8a [Michael Armbrust] spelling
aa8c37c [Michael Armbrust] Better toString for SortOrder
1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions
a2e0327 [Michael Armbrust] add a bunch of tests.
4a3e1ea [Michael Armbrust] docs and use shark for data loading.
339bb8f [Michael Armbrust] better docs, Not support
1d7b2d9 [Michael Armbrust] Add NaN conversions.
46a2534 [Michael Armbrust] only run canary query on failure.
8996066 [Michael Armbrust] remove protected from makeCopy
53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass
04a372a [Michael Armbrust] add a flag for running all tests.
3b2235b [Michael Armbrust] More general implementation of arithmetic.
edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results
da6c577 [Michael Armbrust] add string <==> file utility functions.
3adf5ca [Michael Armbrust] Initial support for groupBy and count.
7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ.
a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product.
d66ba7e [Michael Armbrust] drop printlns.
88f2efd [Michael Armbrust] Add sum / count distinct expressions.
05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark
07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands.
b8a9910 [Michael Armbrust] quote tests passing.
8e5e267 [Michael Armbrust] handle aliased select expressions.
4286a96 [Michael Armbrust] drop debugging println
ac34aeb [Michael Armbrust] proof of concept for hive ast transformations.
2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments
ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general.
74a6337 [Michael Armbrust] use fastEquals when doing transformations.
1184a23 [Michael Armbrust] add native test for escapes.
b972b18 [Michael Armbrust] create BaseRelation class
fa6bce9 [Michael Armbrust] implement union
6391a87 [Michael Armbrust] count aggregate.
d47c317 [Michael Armbrust] add unary minus, more tests passing.
c7114e4 [Michael Armbrust] first draft of star expansion.
044c43d [Michael Armbrust] better support for numeric literal parsing.
1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view.
61503c5 [Michael Armbrust] add cached toRdd
2036883 [Michael Armbrust] skip explain queries when testing.
ebac4b1 [Michael Armbrust] fix bug in sort reference calculation
ca0dee0 [Michael Armbrust] docs.
1ee0471 [Michael Armbrust] string literal parsing.
357278b [Michael Armbrust] add limit support
9b3e479 [Michael Armbrust] creation of string literals.
02efa30 [Michael Armbrust] alias evaluation
cb68b33 [Michael Armbrust] parsing for random sample in hive ql.
126dd36 [Michael Armbrust] include query plans in failure output
bb59ae9 [Michael Armbrust] doc fixes
7e68286 [Michael Armbrust] fix confusing naming
768bb25 [Michael Armbrust] handle errors in shark query toString
829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark.  Make test shark a singleton to avoid weirdness with the hive megastore.
ad02e41 [Michael Armbrust] comment jdo dependency
7bc89fe [Michael Armbrust] add collect to TreeNode.
438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs.
09679ee [Michael Armbrust] fix bug in TreeNode foreach
2930b27 [Michael Armbrust] more specific name for del query tests.
8842549 [Michael Armbrust] docs.
da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL.
a8969b9 [Michael Armbrust] Factor out hive query comparison test framework.
1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations.
a36dd9a [Michael Armbrust] evaluation for other > data types.
cae729b [Michael Armbrust] remove unnecessary lazy vals.
d8e12af [Michael Armbrust] docs
3a60d67 [Michael Armbrust] implement average, placeholder for count
f05c106 [Michael Armbrust] checkAnswer handles single row results.
2730534 [Michael Armbrust] implement inputSet
a9aa79d [Michael Armbrust] debugging for sort exec
8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors.
554b4b2 [Michael Armbrust] BoundAttribute pretty printing.
754f5fa [Michael Armbrust] dsl for setting nullability
a206d7a [Michael Armbrust] clean up query tests.
84ad6ef [Michael Armbrust] better sort implementation and tests.
de24923 [Michael Armbrust] add double type.
9611a2c [Michael Armbrust] literal creation for doubles.
7358313 [Michael Armbrust] sort order returns child type.
b544715 [Michael Armbrust] implement eval for rand, and > for doubles
7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols)
1c1a35e [Michael Armbrust] add simple Rand expression.
3ca51de [Michael Armbrust] add orderBy to dsl
7ae41ab [Michael Armbrust] more literal implicit conversions
b18b675 [Michael Armbrust] First cut at native query tests for shark.
d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark.
5eac895 [Michael Armbrust] better error when descending is specified.
2b16f86 [Michael Armbrust] add todo
e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization
9dde3c8 [Michael Armbrust] add project and filter operations.
ad9037b [Michael Armbrust] Add support for local relations.
6227143 [Michael Armbrust] evaluation of Equals.
7526290 [Michael Armbrust] BoundReference should also be an Attribute.
bd33e26 [Michael Armbrust] more documentation
5de0ea3 [Michael Armbrust] Move all shark specific into a separate package.  Lots of documentation improvements.
0ae292b [Michael Armbrust] implement calculation of sort expressions.
9fd5011 [Michael Armbrust] First cut at expression evaluation.
6259e3a [Michael Armbrust] cleanup
787e5a2 [Michael Armbrust] use fastEquals
f90da36 [Michael Armbrust] better printing of optimization exceptions
b05dd67 [Michael Armbrust] Application of rules to fixed point.
bb2e0db [Michael Armbrust] pretty print for literals.
1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals.
d3a3687 [Michael Armbrust] add fastEquals
2b4935b [Michael Armbrust] set sbt.version explicitly
46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests.
c79f2fd [Michael Armbrust] insert operator should return an empty rdd.
14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input.
ae7b4c3 [Michael Armbrust] remove implicit dependencies.  now compiles without copying things into lib/ manually.
84082f9 [Michael Armbrust] add sbt binaries and scripts
15371a8 [Michael Armbrust] First draft of simple Hive DDL parser.
063bf44 [Michael Armbrust] Periods should end all comments.
e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack.
ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing!
b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust.
e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected.
26f410a [Michael Armbrust] physical traits should extend PhysicalPlan.
dc72469 [Michael Armbrust] beginning of hive compatibility testing framework.
0763490 [Michael Armbrust] support for hive native command pass-through.
d8a924f [Michael Armbrust] scaladoc
29a7163 [Michael Armbrust] Insert into hive table physical operator.
633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy.
59ac444 [Michael Armbrust] add unary expression
3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName'
665f7d0 [Michael Armbrust] add logical nodes for hive data sinks.
64d2923 [Michael Armbrust] Add classes for representing sorts.
f72b7ce [Michael Armbrust] first trivial end to end query execution.
5c7d244 [Michael Armbrust] first draft of references implementation.
7bff274 [Michael Armbrust] point at new shark.
c7cd57f [Michael Armbrust] docs for util function.
910811c [Michael Armbrust] check each item of the sequence
ef21a0b [Michael Armbrust] line up comments.
4b765d5 [Michael Armbrust] docs, drop println
6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution.
a703c49 [Michael Armbrust] this order works better until fixed point is implemented.
ec1d7c0 [Michael Armbrust] Simple attribute resolution.
069df02 [Michael Armbrust] parsing binary predicates
a1cf754 [Michael Armbrust] add joins and equality.
3f5bc98 [Michael Armbrust] add optiq to sbt.
54f3460 [Michael Armbrust] initial optiq parsing.
d9161ce [Michael Armbrust] add join operator
1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs
24ef6fb [Michael Armbrust] toString for alias.
ae7d776 [Michael Armbrust] add nullability changing function
d49dc02 [Michael Armbrust] scaladoc for named exprs
7c45dd7 [Michael Armbrust] pretty printing of trees.
78e34bf [Michael Armbrust] simple git ignore.
7ba19be [Michael Armbrust] First draft of interface to hive metastore.
7e7acf0 [Michael Armbrust] physical placeholder.
1c11136 [Michael Armbrust] first draft of error handling / plans for debugging.
3766a41 [Michael Armbrust] rearrange utility functions.
7fb3d5e [Michael Armbrust] docs and equality improvements.
45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions.
002d4d4 [Michael Armbrust] default to no alias.
be25003 [Michael Armbrust] add repl initialization to sbt.
0608a00 [Michael Armbrust] tighten public interface
a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms.
daa71ca [Michael Armbrust] foreach, maps, and scaladoc
6a158cb [Michael Armbrust] simple transform working.
db0299f [Michael Armbrust] basic analysis of relations minus transform function.
f74c4ee [Michael Armbrust] parsing a simple query.
08e4f57 [Michael Armbrust] upgrade scala include shark.
d3c6404 [Michael Armbrust] initial commit
2014-03-20 18:03:20 -07:00
Reza Zadeh 66a03e5fe0 Principal Component Analysis
# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.

## Testing
Tests included:
 * All principal components
 * Only top k principal components
 * Dense SVD tests
 * Dense/sparse matrix tests

The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html

## Documentation
Added to mllib-guide.md

## Example Usage
Added to examples directory under SparkPCA.scala

Author: Reza Zadeh <rizlar@gmail.com>

Closes #88 from rezazadeh/sparkpca and squashes the following commits:

e298700 [Reza Zadeh] reformat using IDE
3f23271 [Reza Zadeh] documentation and cleanup
b025ab2 [Reza Zadeh] documentation
e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
3787bb4 [Reza Zadeh] stylin
c6ecc1f [Reza Zadeh] docs
aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
56975b0 [Reza Zadeh] docs
2df9bde [Reza Zadeh] docs update
8fb0015 [Reza Zadeh] rcond documentation
dbf7797 [Reza Zadeh] correct argument number
a9f1f62 [Reza Zadeh] documentation
4ce6caa [Reza Zadeh] style changes
9a56a02 [Reza Zadeh] use rcond relative to larget svalue
120f796 [Reza Zadeh] housekeeping
156ff78 [Reza Zadeh] string comprehension
2e1cf43 [Reza Zadeh] rename rcond
ea223a6 [Reza Zadeh] many style changes
f4002d7 [Reza Zadeh] more docs
bd53c7a [Reza Zadeh] proper accumulator
a8b5ecf [Reza Zadeh] Don't use for loops
0dc7980 [Reza Zadeh] filter zeros in sparse
6115610 [Reza Zadeh] More documentation
36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
bc4599f [Reza Zadeh] configurable rcond
86f7515 [Reza Zadeh] compute per parition, use while
09726b3 [Reza Zadeh] more style changes
4195e69 [Reza Zadeh] private, accumulator
17002be [Reza Zadeh] style changes
4ba7471 [Reza Zadeh] style change
f4982e6 [Reza Zadeh] Use dense matrix in example
2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
f807be9 [Reza Zadeh] fix typo
2d7ccde [Reza Zadeh] Array interface for dense svd and pca
cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
398d123 [Reza Zadeh] style change
55abbfa [Reza Zadeh] docs fix
ef29644 [Reza Zadeh] bad chnage undo
472566e [Reza Zadeh] all files from old pr
555168f [Reza Zadeh] initial files
2014-03-20 10:39:20 -07:00
Sean Owen 97e4459e1e SPARK-1254. Consolidate, order, and harmonize repository declarations in Maven/SBT builds
This suggestion addresses a few minor suboptimalities with how repositories are handled.

1) Use HTTPS consistently to access repos, instead of HTTP

2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)

3) Update SBT build to match Maven build in this regard

4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.

Author: Sean Owen <sowen@cloudera.com>

Closes #145 from srowen/SPARK-1254 and squashes the following commits:

42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
2014-03-15 16:44:34 -07:00
jianghan 31a704004f Fix example bug: compile error
Author: jianghan <jianghan@xiaomi.com>

Closes #132 from pooorman/master and squashes the following commits:

54afbe0 [jianghan] Fix example bug: compile error
2014-03-12 19:46:12 -07:00
CodingCat 9032f7c0d5 SPARK-1160: Deprecate toArray in RDD
https://spark-project.atlassian.net/browse/SPARK-1160

reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python."

In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly

Author: CodingCat <zhunansjtu@gmail.com>

Closes #105 from CodingCat/SPARK-1060 and squashes the following commits:

286f163 [CodingCat] deprecate in JavaRDDLike
ee17b4e [CodingCat] add message and since
2ff7319 [CodingCat] deprecate toArray in RDD
2014-03-12 17:43:12 -07:00
Sandy Ryza a99fb3747a SPARK-1193. Fix indentation in pom.xmls
Author: Sandy Ryza <sandy@cloudera.com>

Closes #91 from sryza/sandy-spark-1193 and squashes the following commits:

a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
2014-03-07 23:10:35 -08:00
anitatailor 9ae919c02f Example for cassandra CQL read/write from spark
Cassandra read/write using CqlPagingInputFormat/CqlOutputFormat

Author: anitatailor <tailor.anita@gmail.com>

Closes #87 from anitatailor/master and squashes the following commits:

3493f81 [anitatailor] Fixed scala style as per review
19480b7 [anitatailor] Example for cassandra CQL read/write from spark
2014-03-06 17:46:43 -08:00
Thomas Graves 7edbea41b4 SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets
resubmit pull request.  was https://github.com/apache/incubator-spark/pull/332.

Author: Thomas Graves <tgraves@apache.org>

Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits:

dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser
05eebed [Thomas Graves] Fix dependency lost in upmerge
d1040ec [Thomas Graves] Fix up various imports
05ff5e0 [Thomas Graves] Fix up imports after upmerging to master
ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase
13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests.
4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets
2f77147 [Thomas Graves] Rework from comments
50dd9f2 [Thomas Graves] fix header in SecurityManager
ecbfb65 [Thomas Graves] Fix spacing and formatting
b514bec [Thomas Graves] Fix reference to config
ed3d1c1 [Thomas Graves] Add security.md
6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments
2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework
5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils
f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
2014-03-06 18:27:50 -06:00
Prashant Sharma 181ec50307 [java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits:

95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch.
85a954e [Prashant Sharma] Nit. import orderings.
673f7ac [Prashant Sharma] Added support for -java-home as well
80a13e8 [Prashant Sharma] Used fake class tag syntax
26eb3f6 [Prashant Sharma] Patrick's comments on PR.
35d8d79 [Prashant Sharma] Specified java 8 building in the docs
31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag.
4ab87d3 [Prashant Sharma] Review feedback on the pr
c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
2014-03-03 22:31:30 -08:00
Patrick Wendell c3f5e07533 SPARK-1121: Include avro for yarn-alpha builds
This lets us explicitly include Avro based on a profile for 0.23.X
builds. It makes me sad how convoluted it is to express this logic
in Maven. @tgraves and @sryza curious if this works for you.

I'm also considering just reverting to how it was before. The only
real problem was that Spark advertised a dependency on Avro
even though it only really depends transitively on Avro through
other deps.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #49 from pwendell/avro-build-fix and squashes the following commits:

8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
2014-03-02 15:18:19 -08:00
Sean Owen fd31adbf27 SPARK-1084.2 (resubmitted)
(Ported from https://github.com/apache/incubator-spark/pull/650 )

This adds one more change though, to fix the scala version warning introduced by json4s recently.

Author: Sean Owen <sowen@cloudera.com>

Closes #32 from srowen/SPARK-1084.2 and squashes the following commits:

9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency
1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
2014-03-02 14:27:53 -08:00
Patrick Wendell 1fd2bfd3dd Remove remaining references to incubation
This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #51 from pwendell/tlp and squashes the following commits:

d553b1b [Patrick Wendell] Remove remaining references to incubation
2014-03-02 01:00:16 -08:00
Sean Owen c0ef3afa82 SPARK-1071: Tidy logging strategy and use of log4j
Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger.

Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j.

This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should:

- Exclude dependencies on log4j, slf4j-log4j12 from Spark
- Include dependency on log4j-over-slf4j
- Include dependency on another logger X, and another slf4j-X
- Recreate any log config that Spark does, that is needed, in the other logger's config

That sounds about right.

Here are the key changes:

- Include the jcl-over-slf4j shim everywhere by depending on it in core.
- Exclude dependencies on commons-logging from third-party libraries.
- Include the jul-to-slf4j shim everywhere by depending on it in core.
- Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings
- Added missing slf4j-log4j12 binding to GraphX, Bagel module tests

And minor/incidental changes:

- Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2
- (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
- (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me)

Author: Sean Owen <sowen@cloudera.com>

Closes #570 from srowen/SPARK-1071 and squashes the following commits:

52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core.
77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
2014-02-23 11:40:55 -08:00
Prashant Sharma 919bd7f669 Merge pull request #567 from ScrapCodes/style2.
SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2

Continuation of PR #557

With this all scala style errors are fixed across the code base !!

The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #567 and squashes the following commits:

3b1ec30 [Prashant Sharma] scala style fixes
2014-02-09 22:17:52 -08:00
Patrick Wendell b69f8b2a01 Merge pull request #557 from ScrapCodes/style. Closes #557.
SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

== Merge branch commits ==

commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4
Author: Prashant Sharma <scrapcodes@gmail.com>
Date:   Sun Feb 9 17:39:07 2014 +0530

    scala style fixes

commit f91709887a8e0b608c5c2b282db19b8a44d53a43
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Jan 24 11:22:53 2014 -0800

    Adding scalastyle snapshot
2014-02-09 10:09:19 -08:00
Mark Hamstra c2341c92bb Merge pull request #542 from markhamstra/versionBump. Closes #542.
Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <markhamstra@gmail.com>

== Merge branch commits ==

commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
Author: Mark Hamstra <markhamstra@gmail.com>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT
2014-02-08 16:00:43 -08:00
Stevo Slavić f7fd80d9a7 Merge pull request #540 from sslavic/patch-3. Closes #540.
Fix line end character stripping for Windows

LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too).

This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.

Author: Stevo Slavić <sslavic@gmail.com>

== Merge branch commits ==

commit 1e43ba0ea773cc005cf0aef78b6c1755f8e88b27
Author: Stevo Slavić <sslavic@gmail.com>
Date:   Wed Feb 5 14:48:29 2014 +0100

    Fix line end character stripping for Windows

    LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too).

    This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.
2014-02-05 10:29:45 -08:00
Henry Saputra 0386f42e38 Merge pull request #529 from hsaputra/cleanup_right_arrowop_scala
Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency

Looks like there are some ⇒ Unicode character (maybe from scalariform) in Scala code.
This PR is to change it to => to get some consistency on the Scala code.

If we want to use ⇒ as default we could use sbt plugin scalariform to make sure all Scala code has ⇒ instead of =>

And remove unused imports found in TwitterInputDStream.scala while I was there =)

Author: Henry Saputra <hsaputra@apache.org>

== Merge branch commits ==

commit 29c1771d346dff901b0b778f764e6b4409900234
Author: Henry Saputra <hsaputra@apache.org>
Date:   Sat Feb 1 22:05:16 2014 -0800

    Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency.
2014-02-02 21:51:17 -08:00
Patrick Wendell a1238bb5fc Merge pull request #492 from skicavs/master
fixed job name and usage information for the JavaSparkPi example
2014-01-22 14:32:59 -08:00
Matei Zaharia d009b17d13 Merge pull request #315 from rezazadeh/sparsesvd
Sparse SVD

# Singular Value Decomposition
Given an *m x n* matrix *A*, compute matrices *U, S, V* such that

*A = U * S * V^T*

There is no restriction on m, but we require n^2 doubles to fit in memory.
Further, n should be less than m.

The decomposition is computed by first computing *A^TA = V S^2 V^T*,
computing svd locally on that (since n x n is small),
from which we recover S and V.
Then we compute U via easy matrix multiplication
as *U =  A * V * S^-1*

Only singular vectors associated with the largest k singular values
If there are k such values, then the dimensions of the return will be:

* *S* is *k x k* and diagonal, holding the singular values on diagonal.
* *U* is *m x k* and satisfies U^T*U = eye(k).
* *V* is *n x k* and satisfies V^TV = eye(k).

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in RDDs.

# Testing
Tests included. They test:
- Decomposition promise (A = USV^T)
- For small matrices, output is compared to that of jblas
- Rank 1 matrix test included
- Full Rank matrix test included
- Middle-rank matrix forced via k included

# Example Usage

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixyEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
      val parts = line.split(',')
      MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4

// recover top 1 singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)

println("singular values = " + decomposed.S.data.toArray.mkString)

# Documentation
Added to docs/mllib-guide.md
2014-01-22 14:01:30 -08:00
Kevin Mader 36f9a64ec9 fixed job name and usage information for the JavaSparkPi example 2014-01-22 15:58:23 +01:00
Tathagata Das 2e95174c45 Added StreamingContext.awaitTermination to streaming examples. 2014-01-20 20:25:04 -08:00
Reza Zadeh caf97a25a2 Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-17 14:34:03 -08:00
Reza Zadeh 4e96757793 make example 0-indexed 2014-01-17 14:33:03 -08:00
Reza Zadeh d28bf41827 changes from PR 2014-01-17 13:39:40 -08:00
Tathagata Das 11e6534d92 Updated java API docs for streaming, along with very minor changes in the code examples. 2014-01-16 14:44:02 -08:00
Ankur Dave 8ea056d721 Add GraphX dependency to examples/pom.xml 2014-01-14 13:58:48 -08:00
Patrick Wendell 23034798d7 Add missing header files 2014-01-14 01:17:13 -08:00
Tathagata Das f8e239e058 Merge remote-tracking branch 'apache/master' into filestream-fix
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
2014-01-13 23:57:27 -08:00
Reza Zadeh 845e568fad Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-13 23:52:34 -08:00
Tathagata Das 4e497db8f3 Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation. 2014-01-13 23:23:46 -08:00
Reynold Xin e2d25d2dfe Merge branch 'master' into graphx 2014-01-13 16:21:26 -08:00
Patrick Wendell b93f9d42f2 Merge pull request #400 from tdas/dstream-move
Moved DStream and PairDSream to org.apache.spark.streaming.dstream

Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure.

Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
2014-01-13 12:18:05 -08:00
Ankur Dave 8ca9773974 Add LiveJournalPageRank example 2014-01-13 12:17:58 -08:00
Tathagata Das 777c181d2f Merge remote-tracking branch 'apache/master' into dstream-move
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
2014-01-12 21:59:51 -08:00
Patrick Wendell 0ab505a29e Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala
Remove simple redundant return statements for Scala methods/functions

Remove simple redundant return statements for Scala methods/functions:

-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
-) Add small changes to making var to val if possible and remove () for simple get

This hopefully makes the review simpler =)

Pass compile and tests.
2014-01-12 21:31:04 -08:00
Patrick Wendell f4d77f8cb8 Rename DStream.foreach to DStream.foreachRDD
`foreachRDD` makes it clear that the granularity of this operator is per-RDD.
As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other
DStream operators which get pushed down to individual records within each RDD.
2014-01-12 17:21:00 -08:00
Henry Saputra 91a563608e Merge branch 'master' into remove_simpleredundantreturn_scala 2014-01-12 10:34:13 -08:00
Henry Saputra 93a65e5fde Remove simple redundant return statement for Scala methods/functions:
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
2014-01-12 10:30:04 -08:00
Reza Zadeh f324d53555 Merge remote-tracking branch 'upstream/master' into sparsesvd 2014-01-11 13:27:15 -08:00
Reza Zadeh 1afdeaeb2f add dimension parameters to example 2014-01-10 21:30:54 -08:00
Tathagata Das 4f39e79c23 Merge remote-tracking branch 'apache/master' into driver-test
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
2014-01-10 15:47:01 -08:00
Tathagata Das e4bb845238 Updated docs based on Patrick's comments in PR 383. 2014-01-10 12:17:09 -08:00
Reza Zadeh 21c8a54c08 Merge remote-tracking branch 'upstream/master' into sparsesvd
Conflicts:
	docs/mllib-guide.md
2014-01-09 22:45:32 -08:00
Reza Zadeh cf5bd4ab2e fix example 2014-01-09 22:39:41 -08:00
Patrick Wendell 997c830e0b Merge pull request #363 from pwendell/streaming-logs
Set default logging to WARN for Spark streaming examples.

This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
2014-01-09 22:22:20 -08:00
Patrick Wendell 7b748b83a1 Minor clean-up 2014-01-09 20:42:48 -08:00
Tathagata Das f1d206c6b4 Merge branch 'standalone-driver' into driver-test
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala
	examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2014-01-09 15:06:24 -08:00
Tathagata Das 6f713e2a3e Changed the way StreamingContext finds and reads checkpoint files, and added JavaStreamingContext.getOrCreate. 2014-01-09 13:42:04 -08:00
Ankur Dave 3b2e22e2c3 Revert changes to examples/.../PageRankUtils.scala
Reverts to 04d83fc37f9eef89c20331c85291a0a169f75e6d:examples/src/main/scala/org/apache/spark/examples/bagel/PageRankUtils.scala.
2014-01-09 13:27:40 -08:00
Patrick Wendell 35f80da21a Set default logging to WARN for Spark streaming examples.
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
2014-01-09 10:42:58 -08:00
Ankur Dave 91227566bc Merge remote-tracking branch 'spark-upstream/master' into HEAD
Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
	pom.xml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2014-01-08 21:19:08 -08:00