Fix the bug that entering only 1 arg will cause array out of bounds exception in PageRank example.
Author: Li Yao <hnkfliyao@gmail.com>
Closes#6455 from lastland/patch-1 and squashes the following commits:
de06128 [Li Yao] Fix the bug that entering only 1 arg will cause array out of bounds exception.
(cherry picked from commit c771589c96)
Signed-off-by: Andrew Or <andrew@databricks.com>
Since `spark-streaming-kafka` now is published for both Scala 2.10 and 2.11, we can move `KafkaWordCount` and `DirectKafkaWordCount` from `examples/scala-2.10/src/` to `examples/src/` so that they will appear in `spark-examples-***-jar` for Scala 2.11.
Author: zsxwing <zsxwing@gmail.com>
Closes#6436 from zsxwing/SPARK-7895 and squashes the following commits:
c6052f1 [zsxwing] Update examples/pom.xml
0bcfa87 [zsxwing] Fix the sleep time
b9d1256 [zsxwing] Move Kafka examples from scala-2.10/src to src
(cherry picked from commit 000df2f0d6)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.
Author: Reynold Xin <rxin@databricks.com>
Closes#6210 from rxin/df-writer-reader-jdbc and squashes the following commits:
7465c2c [Reynold Xin] Fixed unit test.
118e609 [Reynold Xin] Updated tests.
3441b57 [Reynold Xin] Updated javadoc.
13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
(cherry picked from commit 517eb37a85)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6211 from rxin/mllib-reader and squashes the following commits:
79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
(cherry picked from commit 161d0b4a41)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This patch introduces DataFrameWriter and DataFrameReader.
DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
```scala
sqlContext.read.json("...")
sqlContext.read.parquet("...")
```
DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
- mode
- format (e.g. "parquet", "json")
- options (generic options passed down into data sources)
- partitionBy (partitioning columns)
Example usage:
```scala
df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
```
TODO:
- [ ] Documentation update
- [ ] Move JDBC into reader / writer?
- [ ] Deprecate the old interfaces
- [ ] Move the generic load interface into reader.
- [ ] Update example code and documentation
Author: Reynold Xin <rxin@databricks.com>
Closes#6175 from rxin/reader-writer and squashes the following commits:
b146c95 [Reynold Xin] Deprecation of old APIs.
bd8abdf [Reynold Xin] Fixed merge conflict.
26abea2 [Reynold Xin] Added general load methods.
244fbec [Reynold Xin] Added equivalent to example.
4f15d92 [Reynold Xin] Added documentation for partitionBy.
7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
(cherry picked from commit 578bfeeff5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Java and Scala examples for OneVsRest. Fixes the base classifier to be Logistic Regression and accepts the configuration parameters of the base classifier.
Author: Ram Sriharsha <rsriharsha@hw11853.local>
Closes#6115 from harsha2010/SPARK-7575 and squashes the following commits:
87ad3c7 [Ram Sriharsha] extra line
f5d9891 [Ram Sriharsha] Merge branch 'master' into SPARK-7575
7076084 [Ram Sriharsha] cleanup
dfd660c [Ram Sriharsha] cleanup
8703e4f [Ram Sriharsha] update doc
cb23995 [Ram Sriharsha] fix commandline options for JavaOneVsRestExample
69e91f8 [Ram Sriharsha] cleanup
7f4e127 [Ram Sriharsha] cleanup
d4c40d0 [Ram Sriharsha] Code Review fixes
461eb38 [Ram Sriharsha] cleanup
e0106d9 [Ram Sriharsha] Fix typo
935cf56 [Ram Sriharsha] Try to match Java and Scala Example Commandline options
5323ff9 [Ram Sriharsha] cleanup
196a59a [Ram Sriharsha] cleanup
6adfa0c [Ram Sriharsha] Style Fix
8cfc5d5 [Ram Sriharsha] [SPARK-7575] Example code for OneVsRest
(cherry picked from commit cc12a86fb0)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
The difference is because we previously don't fit the intercept in Spark 1.3. Here, we change the input `String` so that the probability of instance 6 can be classified as `1.0` without any ambiguity.
with lambda = 0.001 in current LOR implementation, the prediction is
```
(4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0
(5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
(6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], prediction=1.0
(7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
```
and the training accuracy is
```
(0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], prediction=1.0
(1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0
(2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], prediction=1.0
(3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], prediction=0.0
```
Author: DB Tsai <dbt@netflix.com>
Closes#6109 from dbtsai/lor-example and squashes the following commits:
ac63ce4 [DB Tsai] first commit
(cherry picked from commit c1080b6fdd)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6019 from mengxr/SPARK-7407 and squashes the following commits:
c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
(cherry picked from commit 1b8625f425)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
As is, to specify this option on command line, you have to escape the angle brackets.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
(cherry picked from commit 4f8a155192)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
jira: https://issues.apache.org/jira/browse/SPARK-7475
Add a new argument to specify the algorithm applied to LDA, to exhibit the basic usage of LDAOptimizer.
cc jkbradley
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6000 from hhbyyh/ldaExample and squashes the following commits:
0a7e2bc [Yuhao Yang] fix according to comments
5810b0f [Yuhao Yang] adjust ldaExample for online LDA
(cherry picked from commit b13162b364)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Otherwise it will throw exception:
```
Exception in thread "main" kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
at kafka.producer.Producer.send(Producer.scala:77)
at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96)
at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```
Author: jerryshao <saisai.shao@intel.com>
Closes#5936 from jerryshao/SPARK-7396 and squashes the following commits:
270bbe2 [jerryshao] Fix Kafka Produce throw Exception issue
(cherry picked from commit 316a5c0423)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This PR also makes some of the DataFrame to RDD methods private as the RDD class is private in 1.4
cc rxin pwendell
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#5949 from shivaram/sparkr-examples and squashes the following commits:
6c42fdc [Shivaram Venkataraman] Remove SparkR RDD examples, add dataframe examples
(cherry picked from commit 4e930420c1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
The following items are added to Python kmeans:
kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k
Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
Closes#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
(cherry picked from commit 5995ada96b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
copy(extra).fit(dataset)
}
~~~
Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~
Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.
TODOs:
* [x] add tests for newly added methods
* [ ] update documentation
jkbradley dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#5820 from mengxr/SPARK-5956 and squashes the following commits:
7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
(cherry picked from commit e0833c5958)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Main change: Added isValid field to Param. Modified all usages to use isValid when relevant. Added helper methods in ParamValidate.
Also overrode Params.validate() in:
* CrossValidator + model
* Pipeline + model
I made a few updates for the elastic net patch:
* I changed "tol" to "convergenceTol"
* I added some documentation
This PR is Scala + Java only. Python will be in a follow-up PR.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5740 from jkbradley/enforce-validate and squashes the following commits:
ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg. Fixed test failures. Renamed ParamValidate to ParamValidators. Removed explicit type from ParamValidators calls where possible.
bb2665a [Joseph K. Bradley] merged with elastic net pr
ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
6895dfc [Joseph K. Bradley] small cleanups
069ac6d [Joseph K. Bradley] many cleanups
928fb84 [Joseph K. Bradley] Maybe done
a910ac7 [Joseph K. Bradley] still workin
6d60e2e [Joseph K. Bradley] Still workin
b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
dbc9fb2 [Joseph K. Bradley] merged with master. enforcing Params.validate