Commit graph

1096 commits

Author SHA1 Message Date
Mike Dusenberry 1281a35188 [SPARK-7920] [MLLIB] Make MLlib ChiSqSelector Serializable (& Fix Related Documentation Example).
The MLlib ChiSqSelector class is not serializable, and so the example in the ChiSqSelector documentation fails. Also, that example is missing the import of ChiSqSelector.

This PR makes ChiSqSelector extend Serializable in MLlib, and adds the ChiSqSelector import statement to the associated example in the documentation.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6462 from dusenberrymw/Make_ChiSqSelector_Serializable_and_Fix_Related_Docs_Example and squashes the following commits:

9cb2f94 [Mike Dusenberry] Make MLlib ChiSqSelector Serializable.
d9003bf [Mike Dusenberry] Add missing import in MLlib ChiSqSelector Docs Scala example.
2015-05-30 16:50:59 -07:00
Reynold Xin 7716a5a1ec Updated SQL programming guide's Hive connectivity section. 2015-05-30 14:57:23 -07:00
Cheng Lian 6e3f0c7810 [SPARK-7849] [SQL] [Docs] Updates SQL programming guide for 1.4
Author: Cheng Lian <lian@databricks.com>

Closes #6520 from liancheng/spark-7849 and squashes the following commits:

705264b [Cheng Lian] Updates SQL programming guide for 1.4
2015-05-30 12:16:09 -07:00
Taka Shinagawa 3ab71eb9d5 [DOCS] [MINOR] Update for the Hadoop versions table with hadoop-2.6
Updated the doc for the hadoop-2.6 profile, which is new to Spark 1.4

Author: Taka Shinagawa <taka.epsilon@gmail.com>

Closes #6450 from mrt/docfix2 and squashes the following commits:

db1c43b [Taka Shinagawa] Updated the hadoop versions for hadoop-2.6 profile
323710e [Taka Shinagawa] The hadoop-2.6 profile is added to the Hadoop versions table
2015-05-30 08:25:21 -04:00
Sean Owen 8c8de3ed86 [SPARK-7890] [DOCS] Document that Spark 2.11 now supports Kafka
Remove caveat about Kafka / JDBC not being supported for Scala 2.11

Author: Sean Owen <sowen@cloudera.com>

Closes #6470 from srowen/SPARK-7890 and squashes the following commits:

4652634 [Sean Owen] One more rewording
7b7f3c8 [Sean Owen] Restore note about JDBC component
126744d [Sean Owen] Remove caveat about Kafka / JDBC not being supported for Scala 2.11
2015-05-30 07:59:27 -04:00
Octavian Geagla e3a4374833 [SPARK-7459] [MLLIB] ElementwiseProduct Java example
Author: Octavian Geagla <ogeagla@gmail.com>

Closes #6008 from ogeagla/elementwise-prod-doc and squashes the following commits:

72e6dc0 [Octavian Geagla] [SPARK-7459] [MLLIB] Java example import.
cf2afbd [Octavian Geagla] [SPARK-7459] [MLLIB] Update description of example.
b66431b [Octavian Geagla] [SPARK-7459] [MLLIB] Add override annotation to java example, make scala example use same data as java.
6b26b03 [Octavian Geagla] [SPARK-7459] [MLLIB] Fix line which is too long.
79af020 [Octavian Geagla] [SPARK-7459] [MLLIB] Actually don't use Java 8.
9d5b31a [Octavian Geagla] [SPARK-7459] [MLLIB] Don't use Java 8
4f0c92f [Octavian Geagla] [SPARK-7459] [MLLIB] ElementwiseProduct Java example.
2015-05-30 00:00:36 -07:00
Octavian Geagla da2112aef2 [SPARK-7576] [MLLIB] Add spark.ml user guide doc/example for ElementwiseProduct
Author: Octavian Geagla <ogeagla@gmail.com>

Closes #6501 from ogeagla/ml-guide-elemwiseprod and squashes the following commits:

4ad93d5 [Octavian Geagla] [SPARK-7576] [MLLIB] Incorporate code review feedback.
f7be7ad [Octavian Geagla] [SPARK-7576] [MLLIB] Add spark.ml user guide doc/example for ElementwiseProduct.
2015-05-29 23:55:19 -07:00
Taka Shinagawa 3792d25836 [DOCS][Tiny] Added a missing dash(-) in docs/configuration.md
The first line had only two dashes (--) instead of three(---). Because of this missing dash(-), 'jekyll build' command was not converting configuration.md to _site/configuration.html

Author: Taka Shinagawa <taka.epsilon@gmail.com>

Closes #6513 from mrt/docfix3 and squashes the following commits:

c470e2c [Taka Shinagawa] Added a missing dash(-) preventing jekyll from converting configuration.md to html format
2015-05-29 20:35:14 -07:00
Shivaram Venkataraman 5f48e5c33b [SPARK-6806] [SPARKR] [DOCS] Add a new SparkR programming guide
This PR adds a new SparkR programming guide at the top-level. This will be useful for R users as our APIs don't directly match the Scala/Python APIs and as we need to explain SparkR without using RDDs as examples etc.

cc rxin davies pwendell

cc cafreeman -- Would be great if you could also take a look at this !

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6490 from shivaram/sparkr-guide and squashes the following commits:

d5ff360 [Shivaram Venkataraman] Add a section on HiveContext, HQL queries
408dce5 [Shivaram Venkataraman] Fix link
dbb86e3 [Shivaram Venkataraman] Fix minor typo
9aff5e0 [Shivaram Venkataraman] Address comments, use dplyr-like syntax in example
d09703c [Shivaram Venkataraman] Fix default argument in read.df
ea816a1 [Shivaram Venkataraman] Add a new SparkR programming guide Also update write.df, read.df to handle defaults better
2015-05-29 14:11:58 -07:00
WangTaoTheTonic a51b133de3 [SPARK-7524] [SPARK-7846] add configs for keytab and principal, pass these two configs with different way in different modes
* As spark now supports long running service by updating tokens for namenode, but only accept parameters passed with "--k=v" format which is not very convinient. This patch add spark.* configs in properties file and system property.

*  --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main).
In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: "Invalid option: --principal" or "Unrecgnised option: --principal".
We should pass these command args in different forms, say system properties.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6051 from WangTaoTheTonic/SPARK-7524 and squashes the following commits:

e65699a [WangTaoTheTonic] change logic to loadEnvironments
ebd9ea0 [WangTaoTheTonic] merge master
ecfe43a [WangTaoTheTonic] pass keytab and principal seperately in different mode
33a7f40 [WangTaoTheTonic] expand the use of the current configs
08bb4e8 [WangTaoTheTonic] fix wrong cite
73afa64 [WangTaoTheTonic] add configs for keytab and principal, move originals to internal
2015-05-29 11:06:11 -05:00
Xusen Yin 1bd63e82fd [SPARK-7577] [ML] [DOC] add bucketizer doc
CC jkbradley

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6451 from yinxusen/SPARK-7577 and squashes the following commits:

e2dc32e [Xusen Yin] rename colums
e350e49 [Xusen Yin] add all demos
006ddf1 [Xusen Yin] add java test
3238481 [Xusen Yin] add bucketizer
2015-05-28 17:30:12 -07:00
Mike Dusenberry 3e312a5ed0 [DOCS] Fixing broken "IDE setup" link in the Building Spark documentation.
The location of the IDE setup information has changed, so this just updates the link on the Building Spark page.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6467 from dusenberrymw/Fix_Broken_Link_On_Building_Spark_Doc and squashes the following commits:

75c533a [Mike Dusenberry] Fixing broken "IDE setup" link in the Building Spark documentation by pointing to new location.
2015-05-28 17:15:10 -04:00
Matt Wise 35410614de [DOCS] Fix typo in documentation for Java UDF registration
This contribution is my original work and I license the work to the project under the project's open source license

Author: Matt Wise <mwise@quixey.com>

Closes #6447 from wisematthew/fix-typo-in-java-udf-registration-doc and squashes the following commits:

e7ef5f7 [Matt Wise] Fix typo in documentation for Java UDF registration
2015-05-27 22:39:19 -07:00
Cheolsoo Park 6dd645870d [SPARK-7850][BUILD] Hive 0.12.0 profile in POM should be removed
I grep'ed hive-0.12.0 in the source code and removed all the profiles and doc references.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #6393 from piaozhexiu/SPARK-7850 and squashes the following commits:

fb429ce [Cheolsoo Park] Remove hive-0.13.1 profile
82bf09a [Cheolsoo Park] Remove hive 0.12.0 shim code
f3722da [Cheolsoo Park] Remove hive-0.12.0 profile and references from POM and build docs
2015-05-27 00:18:42 -07:00
Mike Dusenberry 0463428b6e [SPARK-7883] [DOCS] [MLLIB] Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation.
Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6422 from dusenberrymw/Fix_MLlib_Collab_Filtering_trainImplicit_Example and squashes the following commits:

36492f4 [Mike Dusenberry] Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures.
2015-05-26 18:08:57 -07:00
Mike Dusenberry e5a63a0e39 [DOCS] [MLLIB] Fixing misformatted links in v1.4 MLlib Naive Bayes documentation by removing space and newline characters.
A couple of links in the MLlib Naive Bayes documentation for v1.4 were broken due to the addition of either space or newline characters between the link title and link URL in the markdown doc.  (Interestingly enough, they are rendered correctly in the GitHub viewer, but not when compiled to HTML by Jekyll.)

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6412 from dusenberrymw/Fix_Broken_Links_In_MLlib_Naive_Bayes_Docs and squashes the following commits:

91a4028 [Mike Dusenberry] Fixing misformatted links by removing space and newline characters.
2015-05-26 17:05:58 +01:00
Calvin Jia ce0051d6f7 [SPARK-6391][DOCS] Document Tachyon compatibility.
Adds a section in the RDD persistence section of the programming-guide docs detailing Spark-Tachyon version compatibility as discussed in [[SPARK-6391]](https://issues.apache.org/jira/browse/SPARK-6391).

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #6382 from calvinjia/spark-6391 and squashes the following commits:

113e863 [Calvin Jia] Move compatibility info to the offheap storage level section.
7942dc5 [Calvin Jia] Add a section in the programming-guide docs for Tachyon compatibility.
2015-05-25 16:50:43 -07:00
Davies Liu 7af3818c6b [SPARK-6806] [SPARKR] [DOCS] Fill in SparkR examples in programming guide
sqlCtx -> sqlContext

You can check the docs by:

```
$ cd docs
$ SKIP_SCALADOC=1 jekyll serve
```
cc shivaram

Author: Davies Liu <davies@databricks.com>

Closes #5442 from davies/r_docs and squashes the following commits:

7a12ec6 [Davies Liu] remove rdd in R docs
8496b26 [Davies Liu] remove the docs related to RDD
e23b9d6 [Davies Liu] delete R docs for RDD API
222e4ff [Davies Liu] Merge branch 'master' into r_docs
89684ce [Davies Liu] Merge branch 'r_docs' of github.com:davies/spark into r_docs
f0a10e1 [Davies Liu] address comments from @shivaram
f61de71 [Davies Liu] Update pairRDD.R
3ef7cf3 [Davies Liu] use + instead of function(a,b) a+b
2f10a77 [Davies Liu] address comments from @cafreeman
9c2a062 [Davies Liu] mention R api together with Python API
23f751a [Davies Liu] Fill in SparkR examples in programming guide
2015-05-23 00:01:40 -07:00
Mike Dusenberry 63a5ce75ea [SPARK-7830] [DOCS] [MLLIB] Adding logistic regression to the list of Multiclass Classification Supported Methods documentation
Added logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6357 from dusenberrymw/Add_LR_To_List_Of_Multiclass_Classification_Methods and squashes the following commits:

7918650 [Mike Dusenberry] Updating broken link due to the "Binary Classification" section on the Linear Methods page being renamed to "Classification".
3005dc2 [Mike Dusenberry] Adding logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing.
2015-05-22 18:03:12 -07:00
Andrew Or 3d8760d76e [SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further
The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes #6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages
2015-05-22 17:37:38 -07:00
Ram Sriharsha 509d55ab41 [SPARK-7574] [ML] [DOC] User guide for OneVsRest
Including Iris Dataset (after shuffling and relabeling 3 -> 0 to confirm to 0 -> numClasses-1 labeling). Could not find an existing dataset in data/mllib for multiclass classification.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6296 from harsha2010/SPARK-7574 and squashes the following commits:

645427c [Ram Sriharsha] cleanup
46c41b1 [Ram Sriharsha] cleanup
2f76295 [Ram Sriharsha] Code Review Fixes
ebdf103 [Ram Sriharsha] Java Example
c026613 [Ram Sriharsha] Code Review fixes
4b7d1a6 [Ram Sriharsha] minor cleanup
13bed9c [Ram Sriharsha] add wikipedia link
bb9dbfa [Ram Sriharsha] Clean up naming
6f90db1 [Ram Sriharsha] [SPARK-7574][ml][doc] User guide for OneVsRest
2015-05-22 13:18:08 -07:00
Joseph K. Bradley 2728c3df66 [SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler
Added user guide sections with code examples.
Also added small Java unit tests to test Java example in guide.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits:

cd47f4b [Joseph K. Bradley] Updated based on code review
f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3
0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests
a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF
2015-05-21 22:59:45 -07:00
Mike Dusenberry e4136ea6c4 [DOCS] [MLLIB] Fixing broken link in MLlib Linear Methods documentation.
Just a small change: fixed a broken link in the MLlib Linear Methods documentation by removing a newline character between the link title and link address.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6340 from dusenberrymw/Fix_MLlib_Linear_Methods_link and squashes the following commits:

0a57818 [Mike Dusenberry] Fixing broken link in MLlib Linear Methods documentation.
2015-05-21 19:05:04 -07:00
Joseph K. Bradley 6d75ed7e5c [SPARK-7585] [ML] [DOC] VectorIndexer user guide section
Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits:

dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
2015-05-21 13:05:48 -07:00
Xiangrui Meng 13348e21b6 [SPARK-7752] [MLLIB] Use lowercase letters for NaiveBayes.modelType
to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire

Author: Xiangrui Meng <meng@databricks.com>

Closes #6277 from mengxr/SPARK-7752 and squashes the following commits:

f38b662 [Xiangrui Meng] add another case _ back in test
ae5c66a [Xiangrui Meng] model type -> modelType
711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
40ae53e [Xiangrui Meng] fix Java test suite
264a814 [Xiangrui Meng] add case _ back
3c456a8 [Xiangrui Meng] update NB user guide
17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
2015-05-21 10:30:08 -07:00
Hari Shreedharan a70bf06b79 [SPARK-7750] [WEBUI] Rename endpoints from json to api to allow fu…
…rther extension to non-json outputs too.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6273 from harishreedharan/json-to-api and squashes the following commits:

e14b73b [Hari Shreedharan] Rename `getJsonServlet` to `getServletHandler` i
42f8acb [Hari Shreedharan] Import order fixes.
2ef852f [Hari Shreedharan] [SPARK-7750][WebUI] Rename endpoints from `json` to `api` to allow further extension to non-json outputs too.
2015-05-20 21:13:10 -05:00
Sandy Ryza 829f1d95ba [SPARK-7579] [ML] [DOC] User guide update for OneHotEncoder
Author: Sandy Ryza <sandy@cloudera.com>

Closes #6126 from sryza/sandy-spark-7579 and squashes the following commits:

5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
2015-05-20 13:10:30 -07:00
ehnalis 3ddf051ee7 [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
Added faster RM-heartbeats on pending container allocations with multiplicative back-off.
Also updated related documentations.

Author: ehnalis <zoltan.zvara@gmail.com>

Closes #6082 from ehnalis/yarn and squashes the following commits:

a1d2101 [ehnalis] MIss-spell fixed.
90f8ba4 [ehnalis] Changed default HB values.
6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value.
08bac63 [ehnalis] Refined style, grammar, removed duplicated code.
073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
2015-05-20 08:27:39 -05:00
Mike Dusenberry 3860520633 [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data Types" documentation should be reordered.
The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix.  This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6270 from dusenberrymw/Reorder_MLlib_Data_Types_Distributed_matrix_docs and squashes the following commits:

6313bab [Mike Dusenberry] The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix.  This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.
2015-05-19 17:18:08 -07:00
Xusen Yin 68fb2a46ed [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml package
CC jkbradley.

JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits:

77014c5 [Xusen Yin] comment fix
57a4c07 [Xusen Yin] small fix for docs
1178c8f [Xusen Yin] remove the correctness check in java suite
1c3f389 [Xusen Yin] delete sbt commit
1af152b [Xusen Yin] check python example code
1b5369e [Xusen Yin] add docs of word2vec
2015-05-19 13:43:48 -07:00
Dice 32fa611b19 [SPARK-7704] Updating Programming Guides per SPARK-4397
The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import the o.a.s.SparkContext._ explicitly any more and can remove some statements around the "implicit conversions" from the latest Programming Guides (1.3.0 and higher)

Author: Dice <poleon.kd@gmail.com>

Closes #6234 from daisukebe/patch-1 and squashes the following commits:

b77ecd9 [Dice] fix a typo
45dfcd3 [Dice] rewording per Sean's advice
a094bcf [Dice] Adding a note for users on any previous releases
a29be5f [Dice] Updating Programming Guides per SPARK-4397
2015-05-19 18:13:09 +01:00
Saleem Ansari df34793ad4 [SPARK-7723] Fix string interpolation in pipeline examples
https://issues.apache.org/jira/browse/SPARK-7723

Author: Saleem Ansari <tuxdna@gmail.com>

Closes #6258 from tuxdna/master and squashes the following commits:

2bb5a42 [Saleem Ansari] Merge branch 'master' into mllib-pipeline
e39db9c [Saleem Ansari] Fix string interpolation in pipeline examples
2015-05-19 10:31:11 +01:00
Mike Dusenberry 61f164d3fd Fixing a few basic typos in the Programming Guide.
Just a few minor fixes in the guide, so a new JIRA issue was not created per the guidelines.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6240 from dusenberrymw/Fix_Programming_Guide_Typos and squashes the following commits:

ffa76eb [Mike Dusenberry] Fixing a few basic typos in the Programming Guide.
2015-05-19 08:59:45 +01:00
Xusen Yin 6008ec14ed [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion
JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).

CC jkbradley

Author: Xusen Yin <yinxusen@gmail.com>

Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:

1a7d80d [Xusen Yin] merge with master
892a8e9 [Xusen Yin] fix python 3 compatibility
ec935bf [Xusen Yin] small fix
3e9fa1d [Xusen Yin] delete note
69fcf85 [Xusen Yin] simplify and add python example
81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
40babfb [Xusen Yin] add java test suite for PolynomialExpansion
2015-05-19 00:06:33 -07:00
Vincenzo Selvaggio 814b3dabdf [SPARK-7272] [MLLIB] User guide for PMML model export
https://issues.apache.org/jira/browse/SPARK-7272

Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>

Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits:

c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export
d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md
2015-05-18 08:46:33 -07:00
Sean Owen 1fd33815f4 [SPARK-4556] [BUILD] binary distribution assembly can't run in local mode
Add note on building a runnable distribution with make-distribution.sh

Author: Sean Owen <sowen@cloudera.com>

Closes #6186 from srowen/SPARK-4556 and squashes the following commits:

4002966 [Sean Owen] Add pointer to --help flag
9fa7883 [Sean Owen] Add note on building a runnable distribution with make-distribution.sh
2015-05-16 08:18:41 +01:00
FavioVazquez d41ae4344c [SPARK-7671] Fix wrong URLs in MLlib Data Types Documentation
There is a mistake in the URL of Matrices in the MLlib Data Types documentation (Local matrix scala section), the URL points to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices which is a mistake, since Matrices is an object that implements factory methods for Matrix that does not have a companion class. The correct link should point to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

There is another mistake, in the Local Vector section in Scala, Java and Python

In the Scala section the URL of Vectors points to the trait Vector (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and not to the factory methods implemented in Vectors.

The correct link should be: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$

In the Java section the URL of Vectors points to the Interface Vector (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html) and not to the Class Vectors

The correct link should be:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html

In the Python section the URL of Vectors points to the class Vector (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector) and not the Class Vectors

The correct link should be:
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors

Author: FavioVazquez <favio.vazquezp@gmail.com>

Closes #6196 from FavioVazquez/fix-typo-matrices-mllib-datatypes and squashes the following commits:

3e9efd5 [FavioVazquez] - Fixed wrong URLs in the MLlib Data Types Documentation
9af7074 [FavioVazquez] Merge remote-tracking branch 'upstream/master'
edab1ef [FavioVazquez] Merge remote-tracking branch 'upstream/master'
b2e2f8c [FavioVazquez] Merge remote-tracking branch 'upstream/master'
2015-05-16 08:07:03 +01:00
Liang-Chi Hsieh c8696337e2 [SPARK-7556] [ML] [DOC] Add user guide for spark.ml Binarizer, including Scala, Java and Python examples
JIRA: https://issues.apache.org/jira/browse/SPARK-7556

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6116 from viirya/binarizer_doc and squashes the following commits:

40cb677 [Liang-Chi Hsieh] Better print out.
5b7ef1d [Liang-Chi Hsieh] Make examples more clear.
1bf9c09 [Liang-Chi Hsieh] For comments.
6cf8cba [Liang-Chi Hsieh] Add user guide for Binarizer.
2015-05-15 15:05:04 -07:00
FavioVazquez 7fb715de6d [SPARK-7249] Updated Hadoop dependencies due to inconsistency in the versions
Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons.

Changes proposed by vanzin resulting from previous pull-request https://github.com/apache/spark/pull/5783 that did not fixed the problem correctly.

Please let me know if this is the correct way of doing this, the comments of vanzin are in the pull-request mentioned.

Author: FavioVazquez <favio.vazquezp@gmail.com>

Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the following commits:

11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in create-release.sh
379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed docs to not ask users to rely on default behavior
3f9249d [FavioVazquez] Merge branch 'master' of https://github.com/apache/spark into update-hadoop-dependencies
31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in create-release.sh, run-tests and in the building-spark documentation
cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about  hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed dependencies
83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was integrated into yarn/pom.xml
93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag on the YARN profile in the main POM
668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM
fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh  due to changes in the default hadoop version set - Erased unnecessary instance of -Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml
0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made in the create-release.sh no that the default hadoop version is the 2.2.0 - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in run-tests - Better example given in the hadoop-third-party-distributions.md now that the default hadoop version is 2.2.0
a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to change in the default set in avro.mapred.classifier in pom.xml
199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in docs/building-spark.md - Remove example of instance -Phadoop-2.2 -Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile when the Hadoop version is 2.2.0, which is now the default .Added comment in the yarn/pom.xml to specify that.
88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of global properties in the pom.xml file - Added comment to specify that the hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the make-distribution.sh file
70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and added hadoop-1 in the Related profiles
287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop version in building-spark. Now is clear that Spark will build against Hadoop 2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the building-spark doc.
1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build profile in hadoop1.0 tests and documentation
6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they contained mostly redundant stuff.
7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
ec91ce3 [FavioVazquez] - Updated protobuf-java version of com.google.protobuf dependancy to fix blocking error when connecting to HDFS via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
2015-05-14 15:22:58 +01:00
Joseph K. Bradley f0c1bc3472 [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer
Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.

I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:

d5d213f [Joseph K. Bradley] small fix
dd6e91a [Joseph K. Bradley] fixes from code review of user guide
33c3ff9 [Joseph K. Bradley] small fix
bc6058c [Joseph K. Bradley] fix link
361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
2015-05-12 16:39:56 -07:00
Yuhao Yang 1d703660d4 [SPARK-7496] [MLLIB] Update Programming guide with Online LDA
jira: https://issues.apache.org/jira/browse/SPARK-7496

Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6046 from hhbyyh/ldaDocument and squashes the following commits:

4b6fbfa [Yuhao Yang] add online paper and some comparison
fd4c983 [Yuhao Yang] update lda document for optimizers
2015-05-12 15:12:29 -07:00
vidmantas zemleris 640f63b959 [SPARK-6994][SQL] Update docs for fetching Row fields by name
add docs for https://issues.apache.org/jira/browse/SPARK-6994

Author: vidmantas zemleris <vidmantas@vinted.com>

Closes #6030 from vidma/docs/row-with-named-fields and squashes the following commits:

241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name
2015-05-11 22:29:24 -07:00
Reynold Xin 3a9b6997df [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
Author: Reynold Xin <rxin@databricks.com>

Closes #6062 from rxin/agg-retain-doc and squashes the following commits:

43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
2015-05-11 18:07:12 -07:00
gchen 8e674331d9 [SPARK-7516] [Minor] [DOC] Replace depreciated inferSchema() with createDataFrame()
JIRA: https://issues.apache.org/jira/browse/SPARK-7516

In sql-programming-guide, deprecated python data frame api inferSchema() should be replaced by createDataFrame():

schemaPeople = sqlContext.inferSchema(people) ->
schemaPeople = sqlContext.createDataFrame(people)

Author: gchen <chenguancheng@gmail.com>

Closes #6041 from gchen/python-docs and squashes the following commits:

c27eb7c [gchen] replace inferSchema() with createDataFrame()
2015-05-11 14:37:18 -07:00
Kousuke Saruta 6e9910c21a [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
Now PySpark on YARN with cluster mode is supported so let's update doc.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:

ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
2015-05-11 14:19:11 -07:00
Sandy Ryza 82fee9d9aa [SPARK-6470] [YARN] Add support for YARN node labels.
This is difficult to write a test for because it relies on the latest version of YARN, but I verified manually that the patch does pass along the label expression on this version and containers are successfully launched.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5242 from sryza/sandy-spark-6470 and squashes the following commits:

6af87b9 [Sandy Ryza] Change info to warning
6e22d99 [Sandy Ryza] [YARN] SPARK-6470.  Add support for YARN node labels.
2015-05-11 12:09:39 -07:00
Kirill A. Korinskiy 8c07c75c98 [SPARK-5521] PCA wrapper for easy transform vectors
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.

Example of usage:
```
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import org.apache.spark.mllib.feature.PCA

  val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
  }.cache()

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

  val pca = PCA.create(training.first().features.size/2, data.map(_.features))
  val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
  val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))

  val numIterations = 100
  val model = LinearRegressionWithSGD.train(training, numIterations)
  val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)

  val valuesAndPreds = test.map { point =>
    val score = model.predict(point.features)
    (score, point.label)
  }

  val valuesAndPreds_pca = test_pca.map { point =>
    val score = model_pca.predict(point.features)
    (score, point.label)
  }

  val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
  val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()

  println("Mean Squared Error = " + MSE)
  println("PCA Mean Squared Error = " + MSE_pca)
```

Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4304 from catap/pca and squashes the following commits:

501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
2015-05-10 13:34:00 -07:00
dobashim 7d0f17208c [STREAMING] [DOCS] Fix wrong url about API docs of StreamingListener
A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)

Author: dobashim <dobashim@oss.nttdata.co.jp>

Closes #6024 from dobashim/master and squashes the following commits:

ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
2015-05-09 10:14:46 +01:00
Imran Rashid c796be70f3 [SPARK-3454] separate json endpoints for data in the UI
Exposes data available in the UI as json over http.  Key points:

* new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
* Uses jersey + jackson for routing & converting POJOs into json
* tests against known results in `HistoryServerSuite`
* also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.

Author: Imran Rashid <irashid@cloudera.com>

Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits:

1a72ed6 [Imran Rashid] rats
85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
4e12013 [Imran Rashid] just use test case name for expectation file name
863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
2015-05-08 16:54:32 +01:00
Octavian Geagla 658a478d3f [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product Transformer
See https://issues.apache.org/jira/browse/SPARK-5726

Author: Octavian Geagla <ogeagla@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:

fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
90f7e39 [Joseph K. Bradley] small cleanups
4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer
2015-05-07 14:49:55 -07:00