Commit graph

1125 commits

Author SHA1 Message Date
Sandy Ryza 2d87a415f2 SPARK-3642. Document the nuances of shared variables.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #2490 from sryza/sandy-spark-3642 and squashes the following commits:

aae3340 [Sandy Ryza] SPARK-3642. Document the nuances of broadcast variables
2015-03-11 13:22:05 +00:00
Ilya Ganelin 548643a9e4 [SPARK-4423] Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
Hi all - I've added a writeup on how closures work within Spark to help clarify the general case for this problem and similar problems. I hope this addresses the issue and would love any feedback.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4696 from ilganeli/SPARK-4423 and squashes the following commits:

c5dc498 [Ilya Ganelin] Fixed typo
07b78e8 [Ilya Ganelin] Updated to fix capitalization
48c1983 [Ilya Ganelin] Updated to fix capitalization and clarify wording
2fd2a07 [Ilya Ganelin] Incoporated a few more minor fixes. Fixed a bug in python code. Added semicolons for java
4772f99 [Ilya Ganelin] Incorporated latest feedback
448bd79 [Ilya Ganelin] Updated some verbage and added section links
5dbbda5 [Ilya Ganelin] Improved some wording
d374d3a [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4423
2600668 [Ilya Ganelin] Minor edits
c768ab2 [Ilya Ganelin] Updated documentation to add a section on closures. This helps understand confusing behavior of foreach and map functions when attempting to modify variables outside of the scope of an RDD action or transformation
2015-03-11 13:20:15 +00:00
Sean Owen 35b25640a4 [MINOR] [DOCS] Fix map -> mapToPair in Streaming Java example
Fix map -> mapToPair in Java example. (And zap some unneeded "throws Exception" while here)

Author: Sean Owen <sowen@cloudera.com>

Closes #4967 from srowen/MapToPairFix and squashes the following commits:

ded2bc0 [Sean Owen] Fix map -> mapToPair in Java example. (And zap some unneeded "throws Exception" while here)
2015-03-11 12:16:32 +00:00
Marcelo Vanzin 517975d89d [SPARK-4924] Add a library for launching Spark jobs programmatically.
This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.

The overall goal of this change is twofold, as described in the bug:

- Provide a public API for launching Spark processes. This is a common request
  from users and currently there's no good answer for it.

- Remove a lot of the duplicated code and other coupling that exists in the
  different parts of Spark that deal with launching processes.

A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:

18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
2ce741f [Marcelo Vanzin] Add lots of quotes.
3b28a75 [Marcelo Vanzin] Update new pom.
a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
897141f [Marcelo Vanzin] Review feedback.
e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
28cd35e [Marcelo Vanzin] Remove stale comment.
b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
5f4ddcc [Marcelo Vanzin] Better usage messages.
92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
6184c07 [Marcelo Vanzin] Rename field.
4c19196 [Marcelo Vanzin] Update comment.
7e66c18 [Marcelo Vanzin] Fix pyspark tests.
0031a8e [Marcelo Vanzin] Review feedback.
c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
28b1434 [Marcelo Vanzin] Add a comment.
304333a [Marcelo Vanzin] Fix propagation of properties file arg.
bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
8ec0243 [Marcelo Vanzin] Add missing newline.
95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
de81da2 [Marcelo Vanzin] Fix CommandUtils.
86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
7cff919 [Marcelo Vanzin] Javadoc updates.
eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
7ed8859 [Marcelo Vanzin] Some more feedback.
54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
61919df [Marcelo Vanzin] Clean leftover debug statement.
aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
e584fc3 [Marcelo Vanzin] Rework command building a little bit.
525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
8ac4e92 [Marcelo Vanzin] Minor test cleanup.
e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
c617539 [Marcelo Vanzin] Review feedback round 1.
fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
a7936ef [Marcelo Vanzin] Fix pyspark tests.
656374e [Marcelo Vanzin] Mima fixes.
4d511e7 [Marcelo Vanzin] Fix tools search code.
7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.
2015-03-11 01:03:01 -07:00
Michael Armbrust 2672374110 [SPARK-5183][SQL] Update SQL Docs with JDBC and Migration Guide
Author: Michael Armbrust <michael@databricks.com>

Closes #4958 from marmbrus/sqlDocs and squashes the following commits:

9351dbc [Michael Armbrust] fix parquet example
6877e13 [Michael Armbrust] add sql examples
d81b7e7 [Michael Armbrust] rxins comments
e393528 [Michael Armbrust] fix order
19c2735 [Michael Armbrust] more on data source load/store
00d5914 [Michael Armbrust] Update SQL Docs with JDBC and Migration Guide
2015-03-10 18:13:09 -07:00
Reynold Xin 3cac1991a1 [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
Author: Reynold Xin <rxin@databricks.com>

Closes #4954 from rxin/df-docs and squashes the following commits:

c592c70 [Reynold Xin] [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
2015-03-09 16:16:16 -07:00
RobertZK 48a723c986 Fix python typo (+ Scala, Java typos)
Author: RobertZK <technoguyrob@gmail.com>
Author: Robert Krzyzanowski <technoguyrob@gmail.com>

Closes #4840 from robertzk/patch-1 and squashes the following commits:

d286215 [RobertZK] lambda fix per @laserson
5937989 [Robert Krzyzanowski] Fix python typo
2015-03-07 00:39:24 +00:00
tedyu 8d3e2414d4 SPARK-6085 Increase default value for memory overhead
Author: tedyu <yuzhihong@gmail.com>

Closes #4836 from tedyu/master and squashes the following commits:

d65b495 [tedyu] SPARK-6085 Increase default value for memory overhead
1fdd4df [tedyu] SPARK-6085 Increase default value for memory overhead
2015-03-04 11:00:52 +00:00
Sean Owen e750a6bfdd SPARK-1911 [DOCS] Warn users if their assembly jars are not built with Java 6
Add warning about building with Java 7+ and running the JAR on early Java 6.

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4874 from srowen/SPARK-1911 and squashes the following commits:

79fa2f6 [Sean Owen] Add warning about building with Java 7+ and running the JAR on early Java 6.
2015-03-03 13:40:11 -08:00
DB Tsai b196056190 [SPARK-5537][MLlib][Docs] Add user guide for multinomial logistic regression
Adding more description on top of #4861.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #4866 from dbtsai/doc and squashes the following commits:

37e9d07 [DB Tsai] doc
2015-03-02 22:37:12 -08:00
Xiangrui Meng 7e53a79c30 [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlib
Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4854 from mengxr/SPARK-6097 and squashes the following commits:

4586a4d [Xiangrui Meng] fix more typos
8ebcac2 [Xiangrui Meng] fix python style
91172d8 [Xiangrui Meng] fix typos
201b3b9 [Xiangrui Meng] update user guide
b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
2015-03-02 22:27:01 -08:00
Xiangrui Meng 9d6c5aeebd [SPARK-5537] Add user guide for multinomial logistic regression
This is based on #4801 from dbtsai. The linear method guide is re-organized a little bit for this change.

Closes #4801

Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>

Closes #4861 from mengxr/SPARK-5537 and squashes the following commits:

47af0ac [Xiangrui Meng] update user guide for multinomial logistic regression
cdc2e15 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into AlpineNow-mlor-doc
096d0ca [DB Tsai] first commit
2015-03-02 18:10:50 -08:00
Andrew Or 258d154c9f [SPARK-6048] SparkConf should not translate deprecated configs on set
There are multiple issues with translating on set outlined in the JIRA.

This PR reverts the translation logic added to `SparkConf`. In the future, after the 1.3.0 release we will figure out a way to reorganize the internal structure more elegantly. For now, let's preserve the existing semantics of `SparkConf` since it's a public interface. Unfortunately this means duplicating some code for now, but this is all internal and we can always clean it up later.

Author: Andrew Or <andrew@databricks.com>

Closes #4799 from andrewor14/conf-set-translate and squashes the following commits:

11c525b [Andrew Or] Move warning to driver
10e77b5 [Andrew Or] Add documentation for deprecation precedence
a369cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into conf-set-translate
c26a9e3 [Andrew Or] Revert all translate logic in SparkConf
fef6c9c [Andrew Or] Restore deprecation logic for spark.executor.userClassPathFirst
94b4dfa [Andrew Or] Translate on get, not set
2015-03-02 16:36:42 -08:00
Sean Owen 0b472f60cd SPARK-5390 [DOCS] Encourage users to post on Stack Overflow in Community Docs
Point "Community" to main Spark Community page; mention SO tag apache-spark.

Separately, the Apache site can be updated to mention, under Mailing Lists:
"StackOverflow also has an apache-spark tag for Spark Q&A." or similar.

Author: Sean Owen <sowen@cloudera.com>

Closes #4843 from srowen/SPARK-5390 and squashes the following commits:

3508ac6 [Sean Owen] Point "Community" to main Spark Community page; mention SO tag apache-spark
2015-03-02 21:10:08 +00:00
DEBORAH SIEGEL e7d8ae444f aggregateMessages example in graphX doc
Examples illustrating difference between legacy mapReduceTriplets usage and aggregateMessages usage has type issues on the reduce for both operators.

Being just an example-  changed example to reduce the message String by concatenation. Although non-optimal for performance.

Author: DEBORAH SIEGEL <deborahsiegel@DEBORAHs-MacBook-Pro.local>

Closes #4853 from d3borah/master and squashes the following commits:

db54173 [DEBORAH SIEGEL] fixed aggregateMessages example in graphX doc
2015-03-02 10:15:32 -08:00
MechCoder 3f00bb3ef1 [SPARK-6083] [MLLib] [DOC] Make Python API example consistent in NaiveBayes
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4834 from MechCoder/spark-6083 and squashes the following commits:

1cdd7b5 [MechCoder] Add parse function
65bbbe9 [MechCoder] [SPARK-6083] Make Python API example consistent in NaiveBayes
2015-03-01 16:28:15 -08:00
Xiangrui Meng aedbbaa3dd [SPARK-6053][MLLIB] support save/load in PySpark's ALS
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:

f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
2015-03-01 16:26:57 -08:00
Joseph K. Bradley d17cb2ba33 [SPARK-4587] [mllib] [docs] Fixed save,load calls in ML guide examples
Should pass spark context to save/load

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4816 from jkbradley/ml-io-doc-fix and squashes the following commits:

83d369d [Joseph K. Bradley] added comment to save,load parts of ML guide examples
2841170 [Joseph K. Bradley] Fixed save,load calls in ML guide examples
2015-02-27 13:00:36 -08:00
许鹏 0375a413b8 fix spark-6033, clarify the spark.worker.cleanup behavior in standalone mode
jira case spark-6033 https://issues.apache.org/jira/browse/SPARK-6033

In standalone deploy mode, the cleanup will only remove the stopped application's directories.

The original description about the cleanup behavior is incorrect.

Author: 许鹏 <peng.xu@fraudmetrix.cn>

Closes #4803 from hseagle/spark-6033 and squashes the following commits:

927a6a0 [许鹏] fix the incorrect description about the spark.worker.cleanup in standalone mode
2015-02-26 23:06:34 -08:00
moussa taifi c871e2dae0 Add a note for context termination for History server on Yarn
The history server on Yarn only shows completed jobs. This adds a note concerning the needed explicit context termination at the end of a spark job which is a best practice anyway.
Related to SPARK-2972 and SPARK-3458

Author: moussa taifi <moutai10@gmail.com>

Closes #4721 from moutai/add-history-server-note-for-closing-the-spark-context and squashes the following commits:

9f5b6c3 [moussa taifi] Fix upper case typo for YARN
3ad3db4 [moussa taifi] Add context termination for History server on Yarn
2015-02-26 14:20:30 -08:00
xukun 00228947 8942b522d8 [SPARK-3562]Periodic cleanup event logs
Author: xukun 00228947 <xukun.xu@huawei.com>

Closes #4214 from viper-kun/cleaneventlog and squashes the following commits:

7a5b9c5 [xukun 00228947] fix issue
31674ee [xukun 00228947] fix issue
6e3d06b [xukun 00228947] fix issue
373f3b9 [xukun 00228947] fix issue
71782b5 [xukun 00228947] fix issue
5b45035 [xukun 00228947] fix issue
70c28d6 [xukun 00228947] fix issues
adcfe86 [xukun 00228947] Periodic cleanup event logs
2015-02-26 13:24:00 -08:00
Li Zhihui 10094a523e Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
The configuration is not supported in mesos mode now.
See https://github.com/apache/spark/pull/1462

Author: Li Zhihui <zhihui.li@intel.com>

Closes #4781 from li-zhihui/fixdocconf and squashes the following commits:

63e7a44 [Li Zhihui] Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
2015-02-26 13:07:49 -08:00
Joseph K. Bradley d20559b157 [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide
  * I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:

c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
2015-02-25 16:13:17 -08:00
Brennon York 46a044a36a [SPARK-1182][Docs] Sort the configuration parameters in configuration.md
Sorts all configuration options present on the `configuration.md` page to ease readability.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits:

5696f21 [Brennon York] fixed merge conflict with port comments
81a7b10 [Brennon York] capitalized A in Allocation
e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc
7de5f75 [Brennon York] moved serialization from application to compression and serialization section
a16fec0 [Brennon York] moved shuffle settings from network to shuffle
f8fa286 [Brennon York] sorted encryption category
1023f15 [Brennon York] moved initialExecutors
e9d62aa [Brennon York] fixed akka.heartbeat.interval
25e6f6f [Brennon York] moved spark.executer.user*
4625ade [Brennon York] added spark.executor.extra* items
4ee5648 [Brennon York] fixed merge conflicts
1b49234 [Brennon York] sorting mishap
2b5758b [Brennon York] sorting mishap
6fbdf42 [Brennon York] sorting mishap
55dc6f8 [Brennon York] sorted security
ec34294 [Brennon York] sorted dynamic allocation
2a7c4a3 [Brennon York] sorted scheduling
aa9acdc [Brennon York] sorted networking
a4380b8 [Brennon York] sorted execution behavior
27f3919 [Brennon York] sorted compression and serialization
80a5bbb [Brennon York] sorted spark ui
3f32e5b [Brennon York] sorted shuffle behavior
6c51b38 [Brennon York] sorted runtime environment
efe9d6f [Brennon York] sorted application properties
2015-02-25 16:12:56 -08:00
Sean Owen 7d8e6a2e44 SPARK-5930 [DOCS] Documented default of spark.shuffle.io.retryWait is confusing
Clarify default max wait in spark.shuffle.io.retryWait docs

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4769 from srowen/SPARK-5930 and squashes the following commits:

ae2792b [Sean Owen] Clarify default max wait in spark.shuffle.io.retryWait docs
2015-02-25 12:20:44 -08:00
Benedikt Linse 5b8480e035 [GraphX] fixing 3 typos in the graphx programming guide
Corrected 3 Typos in the GraphX programming guide. I hope this is the correct way to contribute.

Author: Benedikt Linse <benedikt.linse@gmail.com>

Closes #4766 from 1123/master and squashes the following commits:

8a63812 [Benedikt Linse] fixing 3 typos in the graphx programming guide
2015-02-25 14:46:17 +00:00
Davies Liu d641fbb39c [SPARK-5994] [SQL] Python DataFrame documentation fixes
select empty should NOT be the same as select. make sure selectExpr is behaving the same.
join param documentation
link to source doesn't work in jekyll generated file
cross reference of columns (i.e. enabling linking)
show(): move df example before df.show()
move tests in SQLContext out of docstring otherwise doc is too long
Column.desc and .asc doesn't have any documentation
in documentation, sort functions.*)

Author: Davies Liu <davies@databricks.com>

Closes #4756 from davies/df_docs and squashes the following commits:

f30502c [Davies Liu] fix doc
32f0d46 [Davies Liu] fix DataFrame docs
2015-02-24 20:51:55 -08:00
MechCoder 2a0fe34891 [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.

This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4677 from MechCoder/spark-5436 and squashes the following commits:

1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
2015-02-24 15:13:22 -08:00
Judy c5ba975ee8 [Spark-5708] Add Slf4jSink to Spark Metrics
Add Slf4jSink to Spark Metrics using Coda Hale's SlfjReporter.
This sends metrics to log4j, allowing spark users to reuse log4j pipeline for metrics collection.

Reviewed existing unit tests and didn't see any sink-related tests. Please advise on if tests should be added.

Author: Judy <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>

Closes #4644 from judynash/master and squashes the following commits:

57ef214 [judynash] doc clarification and indent fixes
a751a66 [Judy] Spark-5708: Add Slf4jSink to Spark Metrics
2015-02-24 20:50:16 +00:00
Xiangrui Meng 105791e35c [MLLIB] Change x_i to y_i in Variance's user guide
Variance is calculated on labels/responses.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4740 from mengxr/patch-1 and squashes the following commits:

673317b [Xiangrui Meng] [MLLIB] Change x_i to y_i in Variance's user guide
2015-02-24 11:38:59 -08:00
Xiangrui Meng cf2e41653d [SPARK-5958][MLLIB][DOC] update block matrix user guide
* Removed SVD code from examples.
* Corrected Java API doc link.
* Updated variable names: `AtransposeA` -> `ata`.
* Minor changes.

brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #4737 from mengxr/update-block-matrix-user-guide and squashes the following commits:

70f53ac [Xiangrui Meng] update block matrix user guide
2015-02-23 22:08:44 -08:00
Joseph K. Bradley 59536cc87e [SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs
Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity

CC: avulanov  I noticed a typo post-hoc and ended up making a few small edits.  Do the changes look OK?

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits:

9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
2015-02-23 16:15:57 -08:00
Alexander Ulanov 28ccf5ee76 [MLLIB] SPARK-5912 Programming guide for feature selection
Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:

19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs
2015-02-23 12:09:40 -08:00
CodingCat 242d49584c [SPARK-5724] fix the misconfiguration in AkkaUtils
https://issues.apache.org/jira/browse/SPARK-5724

In AkkaUtil, we set several failure detector related the parameters as following

```
al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
      .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
      s"""
      |akka.daemonic = on
      |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
      |akka.stdout-loglevel = "ERROR"
      |akka.jvm-exit-on-fatal-error = off
      |akka.remote.require-cookie = "$requireCookie"
      |akka.remote.secure-cookie = "$secureCookie"
      |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
      |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
      |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
      |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
      |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
      |akka.remote.netty.tcp.hostname = "$host"
      |akka.remote.netty.tcp.port = $port
      |akka.remote.netty.tcp.tcp-nodelay = on
      |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
      |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
      |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
      |akka.actor.default-dispatcher.throughput = $akkaBatchSize
      |akka.log-config-on-start = $logAkkaConfig
      |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
      |akka.log-dead-letters = $lifecycleEvents
      |akka.log-dead-letters-during-shutdown = $lifecycleEvents
      """.stripMargin))

```

Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold"
see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
what we have is "akka.remote.watch-failure-detector.threshold"

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits:

bafe56e [CodingCat] fix the grammar in configuration doc
338296e [CodingCat] remove failure-detector related info
8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
2015-02-23 11:29:25 +00:00
Alexander a7f9039025 [DOCS] Fix typo in API for custom InputFormats based on the “new” MapReduce API
This looks like a simple typo ```SparkContext.newHadoopRDD``` instead of ```SparkContext.newAPIHadoopRDD``` as in actual http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.SparkContext

Author: Alexander <abezzubov@nflabs.com>

Closes #4718 from bzz/hadoop-InputFormats-doc-fix and squashes the following commits:

680a4c4 [Alexander] Fix typo in docs on custom Hadoop InputFormats
2015-02-22 08:53:05 +00:00
Joseph K. Bradley 4a17eedb16 [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.

For SPARK-5892:
* Fix Python docs
* Various other cleanups

BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]

CC: mengxr  (ML),  davies  (Python docs)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:

f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
2015-02-20 02:31:32 -08:00
Xiangrui Meng 0cfd2cebde [SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendly
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.

Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.

I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:

865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
2015-02-19 18:06:16 -08:00
Ilya Ganelin 6bddc40353 SPARK-5570: No docs stating that `new SparkConf().set("spark.driver.memory", ...) will not work
I've updated documentation to reflect true behavior of this setting in client vs. cluster mode.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4665 from ilganeli/SPARK-5570 and squashes the following commits:

5d1c8dd [Ilya Ganelin] Added example configuration code
a51700a [Ilya Ganelin] Getting rid of extra spaces
85f7a08 [Ilya Ganelin] Reworded note
5889d43 [Ilya Ganelin] Formatting adjustment
f149ba1 [Ilya Ganelin] Minor updates
1fec7a5 [Ilya Ganelin] Updated to add clarification for other driver properties
db47595 [Ilya Ganelin] Slight formatting update
c899564 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5570
17b751d [Ilya Ganelin] Updated documentation for driver-memory to reflect its true behavior in client vs cluster mode
2015-02-19 15:53:20 -08:00
Xiangrui Meng d12d2ad76e [SPARK-5879][MLLIB] update PIC user guide and add a Java example
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:

847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
2015-02-18 16:29:32 -08:00
Burak Yavuz a8eb92dcb9 [SPARK-5507] Added documentation for BlockMatrix
Docs for BlockMatrix. mengxr

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4664 from brkyvz/SPARK-5507PR and squashes the following commits:

4db30b0 [Burak Yavuz] [SPARK-5507] Added documentation for BlockMatrix
2015-02-18 10:11:08 -08:00
Xiangrui Meng 85e9d091d5 [SPARK-5519][MLLIB] add user guide with example code for fp-growth
The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4661 from mengxr/SPARK-5519 and squashes the following commits:

58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth
2015-02-18 10:09:56 -08:00
MechCoder e79a7a626d SPARK-4610 addendum: [Minor] [MLlib] Minor doc fix in GBT classification example
numClassesForClassification has been renamed to numClasses.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4672 from MechCoder/minor-doc and squashes the following commits:

d2ddb7f [MechCoder] Minor doc fix in GBT classification example
2015-02-18 10:13:40 +00:00
Burak Yavuz ae6cfb3acd [SPARK-5811] Added documentation for maven coordinates and added Spark Packages support
Documentation for maven coordinates + Spark Package support. Added pyspark tests for `--packages`

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #4662 from brkyvz/SPARK-5811 and squashes the following commits:

56ccccd [Burak Yavuz] fixed broken test
64cb8ee [Burak Yavuz] passed pep8 on local
c07b81e [Burak Yavuz] fixed pep8
a8bd6b7 [Burak Yavuz] submit PR
4ef4046 [Burak Yavuz] ready for PR
8fb02e5 [Burak Yavuz] merged master
25c9b9f [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into python-jar
560d13b [Burak Yavuz] before PR
17d3f76 [Davies Liu] support .jar as python package
a3eb717 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5811
c60156d [Burak Yavuz] [SPARK-5811] Added documentation for maven coordinates
2015-02-17 17:23:22 -08:00
CodingCat 31efb39c1d [Minor] fix typo in SQL document
Author: CodingCat <zhunansjtu@gmail.com>

Closes #4656 from CodingCat/fix_typo and squashes the following commits:

b41d15c [CodingCat] recover
689fe46 [CodingCat] fix typo
2015-02-17 12:16:52 -08:00
Patrick Wendell a51d51ffac SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream
Author: Patrick Wendell <patrick@databricks.com>

Closes #4638 from pwendell/SPARK-5850 and squashes the following commits:

386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
2015-02-16 20:33:33 -08:00
Patrick Wendell 04b401da81 HOTFIX: Break in Jekyll build from #4589
That patch had a line break in the middle of a {{ }} expression, which is not allowed.
2015-02-16 15:44:01 -08:00
martinzapletal 61eb12674b [MLLIB][SPARK-5502] User guide for isotonic regression
User guide for isotonic regression added to docs/mllib-regression.md including code examples for Scala and Java.

Author: martinzapletal <zapletal-martin@email.cz>

Closes #4536 from zapletal-martin/SPARK-5502 and squashes the following commits:

67fe773 [martinzapletal] SPARK-5502 reworded model prediction rules to use more general language rather than the code/implementation specific terms
80bd4c3 [martinzapletal] SPARK-5502 created docs page for isotonic regression, added links to the page, updated data and examples
7d8136e [martinzapletal] SPARK-5502 Added documentation for Isotonic regression including examples for Scala and Java
504b5c3 [martinzapletal] SPARK-5502 Added documentation for Isotonic regression including examples for Scala and Java
2015-02-15 09:10:03 -08:00
gasparms f80e2629bb [SPARK-5800] Streaming Docs. Change linked files according the selected language
Currently, Spark Streaming Programming Guide after updateStateByKey  explanation links to file stateful_network_wordcount.py and note "For the complete Scala code ..." for any language tab selected. This is an incoherence.

I've changed the guide and link its pertinent example file. JavaStatefulNetworkWordCount.java example was not created so I added to the commit.

Author: gasparms <gmunoz@stratio.com>

Closes #4589 from gasparms/feature/streaming-guide and squashes the following commits:

7f37f89 [gasparms] More style changes
ec202b0 [gasparms] Follow spark style guide
f527328 [gasparms] Improve example to look like scala example
4d8785c [gasparms] Remove throw exception
e92e6b8 [gasparms] Fix incoherence
92db405 [gasparms] Fix Streaming Programming Guide. Change files according the selected language
2015-02-14 20:10:29 +00:00
Xiangrui Meng cc56c8729a [SPARK-5806] re-organize sections in mllib-clustering.md
Put example code close to the algorithm description.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4598 from mengxr/SPARK-5806 and squashes the following commits:

a137872 [Xiangrui Meng] re-organize sections in mllib-clustering.md
2015-02-13 15:09:27 -08:00
Emre Sevinç 9f31db0610 SPARK-5805 Fixed the type error in documentation.
Fixes SPARK-5805 : Fix the type error in the final example given in MLlib - Clustering documentation.

Author: Emre Sevinç <emre.sevinc@gmail.com>

Closes #4596 from emres/SPARK-5805 and squashes the following commits:

1029f66 [Emre Sevinç] SPARK-5805 Fixed the type error in documentation.
2015-02-13 12:31:27 -08:00
Antonio Navarro Perez 6a1be026cf [SQL][DOCS] Update sql documentation
Updated examples using the new api and added DataFrame concept

Author: Antonio Navarro Perez <ajnavarro@users.noreply.github.com>

Closes #4560 from ajnavarro/ajnavarro-doc-sql-update and squashes the following commits:

82ebcf3 [Antonio Navarro Perez] Changed a missing JavaSQLContext to SQLContext.
8d5376a [Antonio Navarro Perez] fixed typo
8196b6b [Antonio Navarro Perez] [SQL][DOCS] Update sql documentation
2015-02-12 12:46:17 -08:00
Sean Owen 9a3ea49f74 SPARK-5727 [BUILD] Remove Debian packaging
(for master / 1.4 only)

Author: Sean Owen <sowen@cloudera.com>

Closes #4526 from srowen/SPARK-5727.2 and squashes the following commits:

83ba49c [Sean Owen] Remove Debian packaging
2015-02-12 12:36:26 +00:00
Daniel Darabos 03bf704bf4 Remove outdated remark about take(n).
Looking at the code, I believe this remark about `take(n)` computing partitions on the driver is no longer correct. Apologies if I'm wrong.

This came up in http://stackoverflow.com/q/28436559/3318517.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #4533 from darabos/patch-2 and squashes the following commits:

cc80f3a [Daniel Darabos] Remove outdated remark about take(n).
2015-02-11 20:24:17 +00:00
Sean Owen bd0d6e0cc3 SPARK-5727 [BUILD] Deprecate Debian packaging
This just adds a deprecation message. It's intended for backporting to branch 1.3 but can go in master too, to be followed by another PR that removes it for 1.4.

Author: Sean Owen <sowen@cloudera.com>

Closes #4516 from srowen/SPARK-5727.1 and squashes the following commits:

d48989f [Sean Owen] Refer to Spark 1.4
6c1c8b3 [Sean Owen] Deprecate Debian packaging
2015-02-11 08:30:16 +00:00
Davies Liu ea60284095 [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

Author: Davies Liu <davies@databricks.com>

Closes #4498 from davies/create and squashes the following commits:

08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
2015-02-10 19:40:12 -08:00
Marcelo Vanzin 20a6013106 [SPARK-2996] Implement userClassPathFirst for driver, yarn.
Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
`spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
modifies the system classpath, instead of restricting the changes to the user's class
loader. So this change implements the behavior of the latter for Yarn, and deprecates
the more dangerous choice.

To be able to achieve feature-parity, I also implemented the option for drivers (the existing
option only applies to executors). So now there are two options, each controlling whether
to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
aliased to the new one (`spark.executor.userClassPathFirst`).

The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
was also doing some things that ended up causing JVM errors depending on how things
were being called.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:

9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
a8c69f1 [Marcelo Vanzin] Review feedback.
cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
fe970a7 [Marcelo Vanzin] Review feedback.
25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a10f379 [Marcelo Vanzin] Some feedback.
3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
7b57cba [Marcelo Vanzin] Remove now outdated message.
5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
fa1aafa [Marcelo Vanzin] Remove write check on user jars.
89d8072 [Marcelo Vanzin] Cleanups.
a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
7d14397 [Marcelo Vanzin] Register user jars in executor up front.
7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
2015-02-09 21:17:28 -08:00
Xiangrui Meng 855d12ac0a [SPARK-5539][MLLIB] LDA guide
This is the LDA user guide from jkbradley with Java and Scala code example.

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4465 from mengxr/lda-guide and squashes the following commits:

6dcb7d1 [Xiangrui Meng] update java example in the user guide
76169ff [Xiangrui Meng] update java example
36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
2015-02-08 23:40:36 -08:00
Sam Halliday 56aff4bd6c SPARK-5665 [DOCS] Update netlib-java documentation
I am the author of netlib-java and I found this documentation to be out of date. Some main points:

1. Breeze has not depended on jBLAS for some time
2. netlib-java provides a pure JVM implementation as the fallback (the original docs did not appear to be aware of this, claiming that gfortran was necessary)
3. The licensing issue is not just about LGPL: optimised natives have proprietary licenses. Building with the LGPL flag turned on really doesn't help you get past this.
4. I really think it's best to direct people to my detailed setup guide instead of trying to compress it into one sentence. It is different for each architecture, each OS, and for each backend.

I hope this helps to clear things up 😄

Author: Sam Halliday <sam.halliday@Gmail.com>
Author: Sam Halliday <sam.halliday@gmail.com>

Closes #4448 from fommil/patch-1 and squashes the following commits:

18cda11 [Sam Halliday] remove link to skillsmatters at request of @mengxr
a35e4a9 [Sam Halliday] reword netlib-java/breeze docs
2015-02-08 16:34:26 -08:00
WangTaoTheTonic d34f79c8db [SPARK-2945][YARN][Doc]add doc for spark.executor.instances
https://issues.apache.org/jira/browse/SPARK-2945

spark.executor.instances works. As this JIRA recommended, we should add docs for this common config.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #4350 from WangTaoTheTonic/SPARK-2945 and squashes the following commits:

4c3913a [WangTaoTheTonic] not compatible with dynamic allocation
5fa9c46 [WangTaoTheTonic] add doc for spark.executor.instances
2015-02-06 11:58:22 -08:00
Andrew Or fe3740c4c8 [SPARK-5636] Ramp up faster in dynamic allocation
A recent patch #4051 made the initial number default to 0. With this change, any Spark application using dynamic allocation's default settings will ramp up very slowly. Since we never request more executors than needed to saturate the pending tasks, it is safe to ramp up quickly. The current default of 60 may be too slow.

Author: Andrew Or <andrew@databricks.com>

Closes #4409 from andrewor14/dynamic-allocation-interval and squashes the following commits:

d3cc485 [Andrew Or] Lower request interval
2015-02-06 10:55:13 -08:00
Travis Galoppo 9ad56ad2a2 [SPARK-5013] [MLlib] Added documentation and sample data file for GaussianMixture
Simple description and code samples (and sample data) for GaussianMixture

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #4401 from tgaloppo/spark-5013 and squashes the following commits:

c9ff9a5 [Travis Galoppo] Fixed link in mllib-clustering.md Added Gaussian mixture and power iteration as available clustering techniques in mllib-guide
2368690 [Travis Galoppo] Minor fixes
3eb41fa [Travis Galoppo] [SPARK-5013] Added documentation and sample data file for GaussianMixture
2015-02-06 10:26:51 -08:00
Miguel Peralvo f827ef4d7e Update ec2-scripts.md
Change spark-version from 1.1.0 to 1.2.0 in the example for spark-ec2/Launch Cluster.

Author: Miguel Peralvo <miguel.peralvo@gmail.com>

Closes #4300 from MiguelPeralvo/patch-1 and squashes the following commits:

38adf0b [Miguel Peralvo] Update ec2-scripts.md
1850869 [Miguel Peralvo] Update ec2-scripts.md
2015-02-06 11:04:48 +00:00
Daoyuan Wang 6fa4ac1b00 [Branch-1.3] [DOC] doc fix for date
Trivial fix.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4400 from adrian-wang/docdate and squashes the following commits:

31bbe40 [Daoyuan Wang] doc fix for date
2015-02-05 12:42:27 -08:00
Matei Zaharia 4d74f0601a [SPARK-5608] Improve SEO of Spark documentation pages
- Add meta description tags on some of the most important doc pages
- Shorten the titles of some pages to have more relevant keywords; for
  example there's no reason to have "Spark SQL Programming Guide - Spark
  1.2.0 documentation", we can just say "Spark SQL - Spark 1.2.0
  documentation".

Author: Matei Zaharia <matei@databricks.com>

Closes #4381 from mateiz/docs-seo and squashes the following commits:

4940563 [Matei Zaharia] [SPARK-5608] Improve SEO of Spark documentation pages
2015-02-05 11:12:50 -08:00
Josh Rosen 9a7ce70eab [SPARK-5411] Allow SparkListeners to be specified in SparkConf and loaded when creating SparkContext
This patch introduces a new configuration option, `spark.extraListeners`, that allows SparkListeners to be specified in SparkConf and registered before the SparkContext is initialized.  From the configuration documentation:

> A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception.

This motivation for this patch is to allow monitoring code to be easily injected into existing Spark programs without having to modify those programs' code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4111 from JoshRosen/SPARK-5190-register-sparklistener-in-sc-constructor and squashes the following commits:

8370839 [Josh Rosen] Two minor fixes after merging with master
6e0122c [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5190-register-sparklistener-in-sc-constructor
1a5b9a0 [Josh Rosen] Remove SPARK_EXTRA_LISTENERS environment variable.
2daff9b [Josh Rosen] Add a couple of explanatory comments for SPARK_EXTRA_LISTENERS.
b9973da [Josh Rosen] Add test to ensure that conf and env var settings are merged, not overriden.
d6f3113 [Josh Rosen] Use getConstructors() instead of try-catch to find right constructor.
d0d276d [Josh Rosen] Move code into setupAndStartListenerBus() method
b22b379 [Josh Rosen] Instantiate SparkListeners from classes listed in configurations.
9c0d8f1 [Josh Rosen] Revert "[SPARK-5190] Allow SparkListeners to be registered before SparkContext starts."
217ecc0 [Josh Rosen] Revert "Add addSparkListener to JavaSparkContext"
25988f3 [Josh Rosen] Add addSparkListener to JavaSparkContext
163ba19 [Josh Rosen] [SPARK-5190] Allow SparkListeners to be registered before SparkContext starts.
2015-02-04 17:18:03 -08:00
Daoyuan Wang 0c20ce69fb [SPARK-4987] [SQL] parquet timestamp type support
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits:

b1e2a0d [Daoyuan Wang] fix for nanos
4dadef1 [Daoyuan Wang] fix wrong read
93f438d [Daoyuan Wang] parquet timestamp support
2015-02-03 12:06:06 -08:00
Cheng Lian 60f67e7a14 [Doc] Minor: Fixes several formatting issues
Fixes several minor formatting issues in the [Continuous Compilation] [1] section.

[1]: http://spark.apache.org/docs/latest/building-spark.html#continuous-compilation

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4316)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4316 from liancheng/fix-build-instruction-docs and squashes the following commits:

0a92e01 [Cheng Lian] Fixes several formatting issues
2015-02-02 21:14:21 -08:00
Jacek Lewandowski cfea30037f Spark 3883: SSL support for HttpServer and Akka
SPARK-3883: SSL support for Akka connections and Jetty based file servers.

This story introduced the following changes:
- Introduced SSLOptions object which holds the SSL configuration and can build the appropriate configuration for Akka or Jetty. SSLOptions can be created by parsing SparkConf entries at a specified namespace.
- SSLOptions is created and kept by SecurityManager
- All Akka actor address creation snippets based on interpolated strings were replaced by a dedicated methods from AkkaUtils. Those methods select the proper Akka protocol - whether akka.tcp or akka.ssl.tcp
- Added tests cases for AkkaUtils, FileServer, SSLOptions and SecurityManager
- Added a way to use node local SSL configuration by executors and driver in standalone mode. It can be done by specifying spark.ssl.useNodeLocalConf in SparkConf.
- Made CoarseGrainedExecutorBackend not overwrite the settings which are executor startup configuration - they are passed anyway from Worker

Refer to https://github.com/apache/spark/pull/3571 for discussion and details

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Author: Jacek Lewandowski <jacek.lewandowski@datastax.com>

Closes #3571 from jacek-lewandowski/SPARK-3883-master and squashes the following commits:

9ef4ed1 [Jacek Lewandowski] Merge pull request #2 from jacek-lewandowski/SPARK-3883-docs2
fb31b49 [Jacek Lewandowski] SPARK-3883: Added SSL setup documentation
2532668 [Jacek Lewandowski] SPARK-3883: Refactored AkkaUtils.protocol method to not use Try
90a8762 [Jacek Lewandowski] SPARK-3883: Refactored methods to resolve Akka address and made it possible to easily configure multiple communication layers for SSL
72b2541 [Jacek Lewandowski] SPARK-3883: A reference to the fallback SSLOptions can be provided when constructing SSLOptions
93050f4 [Jacek Lewandowski] SPARK-3883: SSL support for HttpServer and Akka
2015-02-02 17:27:26 -08:00
Sandy Ryza b2047b55c5 SPARK-4585. Spark dynamic executor allocation should use minExecutors as...
... initial number

Author: Sandy Ryza <sandy@cloudera.com>

Closes #4051 from sryza/sandy-spark-4585 and squashes the following commits:

d1dd039 [Sandy Ryza] Add spark.dynamicAllocation.initialNumExecutors and make min and max not required
b7c59dc [Sandy Ryza] SPARK-4585. Spark dynamic executor allocation should use minExecutors as initial number
2015-02-02 12:27:08 -08:00
Octavian Geagla bdb0680d37 [SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use
This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.

Author: Octavian Geagla <ogeagla@gmail.com>

Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:

fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
2015-02-01 09:21:14 -08:00
sboeschhuawei f377431a57 [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala

Author: sboeschhuawei <stephen.boesch@huawei.com>
Author: Fan Jiang <fanjiang.sc@huawei.com>
Author: Jiang Fan <fjiang6@gmail.com>
Author: Stephen Boesch <stephen.boesch@huawei.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4254 from fjiang6/PIC and squashes the following commits:

4550850 [sboeschhuawei] Removed pic test data
f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
4b78aaf [Xiangrui Meng] refactor PIC
24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
121e4d5 [sboeschhuawei] Remove unused testing data files
1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
43ab10b [sboeschhuawei] Change last two println's to log4j logger
88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
be659e3 [sboeschhuawei] Added mllib specific log4j
90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
b29c0db [Fan Jiang] Update PIClustering.scala
ace9749 [Fan Jiang] Update PIClustering.scala
a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
f656c34 [sboeschhuawei] Added iris dataset
b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
9294263 [sboeschhuawei] Added visualization/plotting of input/output data
e5df2b8 [sboeschhuawei] First end to end working PIC
0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
32a90dc [sboeschhuawei] Update circles test data values
0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
2015-01-30 14:09:49 -08:00
Yandu Oppacher 3bead67d59 [SPARK-4387][PySpark] Refactoring python profiling code to make it extensible
This PR is based on #3255 , fix conflicts and code style.

Closes #3255.

Author: Yandu Oppacher <yandu.oppacher@jadedpixel.com>
Author: Davies Liu <davies@databricks.com>

Closes #3901 from davies/refactor-python-profile-code and squashes the following commits:

b4a9306 [Davies Liu] fix tests
4b79ce8 [Davies Liu] add docstring for profiler_cls
2700e47 [Davies Liu] use BasicProfiler as default
349e341 [Davies Liu] more refactor
6a5d4df [Davies Liu] refactor and fix tests
31bf6b6 [Davies Liu] fix code style
0864b5d [Yandu Oppacher] Remove unused method
76a6c37 [Yandu Oppacher] Added a profile collector to accumulate the profilers per stage
9eefc36 [Yandu Oppacher] Fix doc
9ace076 [Yandu Oppacher] Refactor of profiler, and moved tests around
8739aff [Yandu Oppacher] Code review fixes
9bda3ec [Yandu Oppacher] Refactor profiler code
2015-01-28 13:48:06 -08:00
Sandy Ryza 406f6d3070 SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
Author: Sandy Ryza <sandy@cloudera.com>

Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits:

460827a [Sandy Ryza] Python too
d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
2015-01-28 12:41:23 -08:00
Davies Liu fdaad4eb03 [MLlib] fix python example of ALS in guide
fix python example of ALS in guide, use Rating instead of np.array.

Author: Davies Liu <davies@databricks.com>

Closes #4226 from davies/fix_als_guide and squashes the following commits:

1433d76 [Davies Liu] fix python example of als in guide
2015-01-27 15:33:01 -08:00
Sean Owen 9f6435763d SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone works in cluster mode
This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506.

Author: Sean Owen <sowen@cloudera.com>

Closes #4160 from srowen/SPARK-4506 and squashes the following commits:

5f5f7df [Sean Owen] Update more docs to reflect that standalone works in cluster mode
2015-01-25 15:25:05 -08:00
Sean Owen c586b45dd2 SPARK-3852 [DOCS] Document spark.driver.extra* configs
As per the JIRA. I copied the `spark.executor.extra*` text, but removed info that appears to be specific to the `executor` config and not `driver`.

Author: Sean Owen <sowen@cloudera.com>

Closes #4185 from srowen/SPARK-3852 and squashes the following commits:

f60a8a1 [Sean Owen] Document spark.driver.extra* configs
2015-01-25 15:08:35 -08:00
Jongyoul Lee 09e09c548e [SPARK-5058] Part 2. Typos and broken URL
- Also fixed java link

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #4172 from jongyoul/SPARK-FIXDOC and squashes the following commits:

6be03e5 [Jongyoul Lee] [SPARK-5058] Part 2. Typos and broken URL - Also fixed java link
2015-01-23 23:34:11 -08:00
scwf 1a200a3eea [SQL][Minor] Update sql doc according to data type APIs changes
Follow up of #3925
/cc rxin

Author: scwf <wangfei1@huawei.com>

Closes #4095 from scwf/sql-doc and squashes the following commits:

97e311b [scwf] update sql doc since now expose only one version of the data type APIs
2015-01-18 11:03:13 -08:00
Ilya Ganelin fd3a8a1d15 [SPARK-733] Add documentation on use of accumulators in lazy transformation
I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4022 from ilganeli/SPARK-733 and squashes the following commits:

587def5 [Ilya Ganelin] Updated to clarify verbage
df3afd7 [Ilya Ganelin] Revert "Partially updated task metrics to make some vars private"
3f6c512 [Ilya Ganelin] Revert "Completed refactoring to make vars in TaskMetrics class private"
58034fb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733
4dc2cdb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733
3a38db1 [Ilya Ganelin] Verified documentation update by building via jekyll
33b5a2d [Ilya Ganelin] Added code examples for java and python
1fd59b2 [Ilya Ganelin] Updated documentation for accumulators to highlight lazy evaluation issue
5525c20 [Ilya Ganelin] Completed refactoring to make vars in TaskMetrics class private
c64da4f [Ilya Ganelin] Partially updated task metrics to make some vars private
2015-01-16 13:25:17 -08:00
Sean Owen f6b852aade [DOCS] Fix typo in return type of cogroup
This fixes a simple typo in the cogroup docs noted in http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAMAsSdJ8_24evMAMg7fOZCQjwimisbYWa9v8BN6Rc3JCauja6wmail.gmail.com%3E

I didn't bother with a JIRA

Author: Sean Owen <sowen@cloudera.com>

Closes #4072 from srowen/CogroupDocFix and squashes the following commits:

43c850b [Sean Owen] Fix typo in return type of cogroup
2015-01-16 09:28:44 -08:00
WangTaoTheTonic 2be82b1e66 [SPARK-1507][YARN]specify # cores for ApplicationMaster
Based on top of changes in https://github.com/apache/spark/pull/3806.

https://issues.apache.org/jira/browse/SPARK-1507

`--driver-cores` and `spark.driver.cores` for all cluster modes and `spark.yarn.am.cores` for yarn client mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #4018 from WangTaoTheTonic/SPARK-1507 and squashes the following commits:

01419d3 [WangTaoTheTonic] amend the args name
b255795 [WangTaoTheTonic] indet thing
d86557c [WangTaoTheTonic] some comments amend
43c9392 [WangTao] fix compile error
b39a100 [WangTao] specify # cores for ApplicationMaster
2015-01-16 09:16:56 -08:00
Xiangrui Meng 6abc45e340 [SPARK-5254][MLLIB] remove developers section from spark.ml guide
Forgot to remove this section in #4052.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4053 from mengxr/SPARK-5254-update and squashes the following commits:

f295bde [Xiangrui Meng] remove developers section from spark.ml guide
2015-01-14 18:54:17 -08:00
Xiangrui Meng 13d2406781 [SPARK-5254][MLLIB] Update the user guide to position spark.ml better
The current statement in the user guide may deliver confusing messages to users. spark.ml contains high-level APIs for building ML pipelines. But it doesn't mean that spark.mllib is being deprecated.

First of all, the pipeline API is in its alpha stage and we need to see more use cases from the community to stabilizes it, which may take several releases. Secondly, the components in spark.ml are simple wrappers over spark.mllib implementations. Neither the APIs or the implementations from spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs to build their ML pipelines, but we will keep supporting and adding features to spark.mllib. For example, there are many features in review at https://spark-prs.appspot.com/#mllib. So users should be comfortable with using spark.mllib features and expect more coming. The user guide needs to be updated to make the message clear.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4052 from mengxr/SPARK-5254 and squashes the following commits:

6d5f1d3 [Xiangrui Meng] typo
0cc935b [Xiangrui Meng] update user guide to position spark.ml better
2015-01-14 17:50:33 -08:00
uncleGen 39e333ec43 [SPARK-5131][Streaming][DOC]: There is a discrepancy in WAL implementation and configuration doc.
There is a discrepancy in WAL implementation and configuration doc.

Author: uncleGen <hustyugm@gmail.com>

Closes #3930 from uncleGen/master-clean-doc and squashes the following commits:

3a4245f [uncleGen] doc typo
8e407d3 [uncleGen] doc typo
2015-01-13 10:07:19 -08:00
lewuathe 1656aae2b4 [SPARK-5073] spark.storage.memoryMapThreshold have two default value
Because major OS page sizes is about 4KB, the default value of spark.storage.memoryMapThreshold is integrated to 2 * 4096

Author: lewuathe <lewuathe@me.com>

Closes #3900 from Lewuathe/integrate-memoryMapThreshold and squashes the following commits:

e417acd [lewuathe] [SPARK-5073] Update docs/configuration
834aba4 [lewuathe] [SPARK-5073] Fix style
adcea33 [lewuathe] [SPARK-5073] Integrate memory map threshold to 2MB
fcce2e5 [lewuathe] [SPARK-5073] spark.storage.memoryMapThreshold have two default value
2015-01-11 13:50:42 -08:00
Kousuke Saruta ae628725ab [DOC] Fixed Mesos version in doc from 0.18.1 to 0.21.0
#3934 upgraded Mesos version so we should also fix docs right?

This issue is really minor so I don't file in JIRA.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3982 from sarutak/fix-mesos-version and squashes the following commits:

9a86ee3 [Kousuke Saruta] Fixed mesos version from 0.18.1 to 0.21.0
2015-01-09 14:40:45 -08:00
WangTaoTheTonic e966452060 [SPARK-1953][YARN]yarn client mode Application Master memory size is same as driver memory...
... size

Ways to set Application Master's memory on yarn-client mode:
1.  `spark.yarn.am.memory` in SparkConf or System Properties
2.  default value 512m

Note: this arguments is only available in yarn-client mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #3607 from WangTaoTheTonic/SPARK4181 and squashes the following commits:

d5ceb1b [WangTaoTheTonic] spark.driver.memeory is used in both modes
6c1b264 [WangTaoTheTonic] rebase
b8410c0 [WangTaoTheTonic] minor optiminzation
ddcd592 [WangTaoTheTonic] fix the bug produced in rebase and some improvements
3bf70cc [WangTaoTheTonic] rebase and give proper hint
987b99d [WangTaoTheTonic] disable --driver-memory in client mode
2b27928 [WangTaoTheTonic] inaccurate description
b7acbb2 [WangTaoTheTonic] incorrect method invoked
2557c5e [WangTaoTheTonic] missing a single blank
42075b0 [WangTaoTheTonic] arrange the args and warn logging
69c7dba [WangTaoTheTonic] rebase
1960d16 [WangTaoTheTonic] fix wrong comment
7fa9e2e [WangTaoTheTonic] log a warning
f6bee0e [WangTaoTheTonic] docs issue
d619996 [WangTaoTheTonic] Merge branch 'master' into SPARK4181
b09c309 [WangTaoTheTonic] use code format
ab16bb5 [WangTaoTheTonic] fix bug and add comments
44e48c2 [WangTaoTheTonic] minor fix
6fd13e1 [WangTaoTheTonic] add overhead mem and remove some configs
0566bb8 [WangTaoTheTonic] yarn client mode Application Master memory size is same as driver memory size
2015-01-09 13:23:13 -08:00
Sean Owen 547df97715 SPARK-5136 [DOCS] Improve documentation around setting up Spark IntelliJ project
This PR simply points to the IntelliJ wiki page instead of also including IntelliJ notes in the docs. The intent however is to also update the wiki page with updated tips. This is the text I propose for the IntelliJ section on the wiki. I realize it omits some of the existing instructions on the wiki, about enabling Hive, but I think those are actually optional.

------

IntelliJ supports both Maven- and SBT-based projects. It is recommended, however, to import Spark as a Maven project. Choose "Import Project..." from the File menu, and select the `pom.xml` file in the Spark root directory.

It is fine to leave all settings at their default values in the Maven import wizard, with two caveats. First, it is usually useful to enable "Import Maven projects automatically", sincchanges to the project structure will automatically update the IntelliJ project.

Second, note the step that prompts you to choose active Maven build profiles. As documented above, some build configuration require specific profiles to be enabled. The same profiles that are enabled with `-P[profile name]` above may be enabled on this screen. For example, if developing for Hadoop 2.4 with YARN support, enable profiles `yarn` and `hadoop-2.4`.

These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section.

"Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the  "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources.

Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports.

Author: Sean Owen <sowen@cloudera.com>

Closes #3952 from srowen/SPARK-5136 and squashes the following commits:

f3baa66 [Sean Owen] Point to new IJ / Eclipse wiki link
016b7df [Sean Owen] Point to IntelliJ wiki page instead of also including IntelliJ notes in the docs
2015-01-09 09:35:46 -08:00
WangTaoTheTonic 8fdd48959c [SPARK-2165][YARN]add support for setting maxAppAttempts in the ApplicationSubmissionContext
...xt

https://issues.apache.org/jira/browse/SPARK-2165

I still have 2 questions:
* If this config is not set, we should use yarn's corresponding value or a default value(like 2) on spark side?
* Is the config name best? Or "spark.yarn.am.maxAttempts"?

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #3878 from WangTaoTheTonic/SPARK-2165 and squashes the following commits:

1416c83 [WangTaoTheTonic] use the name spark.yarn.maxAppAttempts
202ac85 [WangTaoTheTonic] rephrase some
afdfc99 [WangTaoTheTonic] more detailed description
91562c6 [WangTaoTheTonic] add support for setting maxAppAttempts in the ApplicationSubmissionContext
2015-01-07 08:14:39 -06:00
Reynold Xin bbcba3a943 [SPARK-5093] Set spark.network.timeout to 120s consistently.
Author: Reynold Xin <rxin@databricks.com>

Closes #3903 from rxin/timeout-120 and squashes the following commits:

7c2138e [Reynold Xin] [SPARK-5093] Set spark.network.timeout to 120s consistently.
2015-01-05 15:19:53 -08:00
Varun Saxena d3f07fd23c [SPARK-4688] Have a single shared network timeout in Spark
[SPARK-4688] Have a single shared network timeout in Spark

Author: Varun Saxena <vsaxena.varun@gmail.com>
Author: varunsaxena <vsaxena.varun@gmail.com>

Closes #3562 from varunsaxena/SPARK-4688 and squashes the following commits:

6e97f72 [Varun Saxena] [SPARK-4688] Single shared network timeout
cd783a2 [Varun Saxena] SPARK-4688
d6f8c29 [Varun Saxena] SCALA-4688
9562b15 [Varun Saxena] SPARK-4688
a75f014 [varunsaxena] SPARK-4688
594226c [varunsaxena] SPARK-4688
2015-01-05 10:32:37 -08:00
Josh Rosen 939ba1f8f6 [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.

Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists.  SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.

In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times.  In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions.  When output spec. validation is enabled, the second calls to these actions will fail due to existing output.

This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler.  This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:

36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
2015-01-04 20:26:18 -08:00
sigmoidanalytics 342612b65f [SPARK-5058] Updated broken links
Updated the broken link pointing to the KafkaWordCount example to the correct one.

Author: sigmoidanalytics <mayur@sigmoidanalytics.com>

Closes #3877 from sigmoidanalytics/patch-1 and squashes the following commits:

3e19b31 [sigmoidanalytics] Updated broken links
2015-01-03 19:46:08 -08:00
Akhil Das cdccc263b2 Fixed typos in streaming-kafka-integration.md
Changed projrect to project :)

Author: Akhil Das <akhld@darktech.ca>

Closes #3876 from akhld/patch-1 and squashes the following commits:

e0cf9ef [Akhil Das] Fixed typos in streaming-kafka-integration.md
2015-01-02 15:12:27 -08:00
luogankun 2deac748b4 [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE is eager
`CACHE TABLE tbl` is now __eager__ by default not __lazy__

Author: luogankun <luogankun@gmail.com>

Closes #3773 from luogankun/SPARK-4930 and squashes the following commits:

cc17b7d [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, add CACHE [LAZY] TABLE [AS SELECT] ...
bffe0e8 [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE tbl is eager
2014-12-30 12:18:55 -08:00
luogankun f7a41a0e79 [SPARK-4916][SQL][DOCS]Update SQL programming guide about cache section
`SchemeRDD.cache()` now uses in-memory columnar storage.

Author: luogankun <luogankun@gmail.com>

Closes #3759 from luogankun/SPARK-4916 and squashes the following commits:

7b39864 [luogankun] [SPARK-4916]Update SQL programming guide
6018122 [luogankun] Merge branch 'master' of https://github.com/apache/spark into SPARK-4916
0b93785 [luogankun] [SPARK-4916]Update SQL programming guide
99b2336 [luogankun] [SPARK-4916]Update SQL programming guide
2014-12-30 12:17:49 -08:00
wangxiaojing 6645e52580 [SPARK-4982][DOC] spark.ui.retainedJobs description is wrong in Spark UI configuration guide
Author: wangxiaojing <u9jing@gmail.com>

Closes #3818 from wangxiaojing/SPARK-4982 and squashes the following commits:

fe2ad5f [wangxiaojing] change stages to jobs
2014-12-29 10:45:26 -08:00
Brennon York a3e51cc990 [SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:

0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
2014-12-27 13:26:38 -08:00
zsxwing f9ed2b6641 [SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience
There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397).

Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine.
```Scala
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object StreamingApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
    val ssc = new StreamingContext(conf, Seconds(10))
    val lines = ssc.textFileStream("/some/path")
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
```

Author: zsxwing <zsxwing@gmail.com>

Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits:

aa6d44a [zsxwing] Fix a copy-paste error
f74c190 [zsxwing] Merge branch 'master' into SPARK-4608
e6f9cc9 [zsxwing] Update the docs
27833bb [zsxwing] Remove `import StreamingContext._`
c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience
2014-12-25 19:46:05 -08:00
Kousuke Saruta 11dd99317b [SPARK-4953][Doc] Fix the description of building Spark with YARN
At the section "Specifying the Hadoop Version" In building-spark.md, there is description about building with YARN with Hadoop 0.23.
Spark 1.3.0 will not support Hadoop 0.23 so we should fix the description.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3787 from sarutak/SPARK-4953 and squashes the following commits:

ee9c355 [Kousuke Saruta] Removed description related to a specific vendor
9ab0c24 [Kousuke Saruta] Fix the description about building SPARK with YARN
2014-12-25 07:05:43 -08:00
zsxwing 2d215aebaa [SPARK-4931][Yarn][Docs] Fix the format of running-on-yarn.md
Currently, the format about log4j in running-on-yarn.md is a bit messy.

![running-on-yarn](https://cloud.githubusercontent.com/assets/1000778/5535248/204c4b64-8ab4-11e4-83c3-b4722ea0ad9d.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #3774 from zsxwing/SPARK-4931 and squashes the following commits:

4a5f853 [zsxwing] Fix the format of running-on-yarn.md
2014-12-23 11:18:06 -08:00
Nicholas Chammas 0e532ccb2b [Docs] Minor typo fixes
Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #3772 from nchammas/patch-1 and squashes the following commits:

b7d9083 [Nicholas Chammas] [Docs] Minor typo fixes
2014-12-22 22:54:32 -08:00
Aaron Davidson fbca6b6ce2 [SPARK-4864] Add documentation to Netty-based configs
Author: Aaron Davidson <aaron@databricks.com>

Closes #3713 from aarondav/netty-configs and squashes the following commits:

8a8b373 [Aaron Davidson] Address Patrick's comments
3b1f84e [Aaron Davidson] [SPARK-4864] Add documentation to Netty-based configs
2014-12-22 13:09:22 -08:00
Tsuyoshi Ozawa 96606f69b7 [SPARK-4915][YARN] Fix classname to be specified for external shuffle service.
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@lab.ntt.co.jp>

Closes #3757 from oza/SPARK-4915 and squashes the following commits:

3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service.
2014-12-22 11:28:05 -08:00
Andrew Or 15c03e1e0e [SPARK-4140] Document dynamic allocation
Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way.

This patch builds on top of oza's work in #3689.

aarondav pwendell

Author: Andrew Or <andrew@databricks.com>
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com>

Closes #3731 from andrewor14/document-dynamic-allocation and squashes the following commits:

1281447 [Andrew Or] Address a few comments
b9843f2 [Andrew Or] Document the configs as well
246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation
8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs)
6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled.
53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.
2014-12-19 19:36:20 -08:00
Eran Medan c25c669d95 change signature of example to match released code
the signature of registerKryoClasses is actually of Array[Class[_]]  not Seq

Author: Eran Medan <ehrann.mehdan@gmail.com>

Closes #3747 from eranation/patch-1 and squashes the following commits:

ee9885d [Eran Medan] change signature of example to match released code
2014-12-19 18:30:09 -08:00
Timothy Chen d9956f86ad Add mesos specific configurations into doc
Author: Timothy Chen <tnachen@gmail.com>

Closes #3349 from tnachen/mesos_doc and squashes the following commits:

737ef49 [Timothy Chen] Add TOC
5ca546a [Timothy Chen] Update description around cores requested.
26283a5 [Timothy Chen] Add mesos specific configurations into doc
2014-12-18 12:15:53 -08:00
Sandy Ryza 253b72b56f SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be...
... changed to a time period

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3471 from sryza/sandy-spark-3779 and squashes the following commits:

20b9887 [Sandy Ryza] Deprecate old property
42b5df7 [Sandy Ryza] Review feedback
9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
2014-12-18 12:19:07 -06:00
Zhan Zhang 3b764699ff [SPARK-4461][YARN] pass extra java options to yarn application master
Currently, there is no way to pass yarn am specific java options. It cause some potential issues when reading classpath from hadoop configuration file. Hadoop configuration actually replace variables in its property with the system property passed in java options. How to specify the value depends on different hadoop distribution.

The new options are SPARK_YARN_JAVA_OPTS or spark.yarn.extraJavaOptions. I make it as spark global level, because typically we don't want user to specify this in their command line each time submitting spark job after it is setup in spark-defaults.conf.

In addition, with this new extra options enabled to be passed to AM, it provides more flexibility.

For example int the following valid mapred-site.xml file, we have the class path which specify values using system property. Hadoop can correctly handle it because it has java options passed in.

This is the example, currently spark will break due to hadoop.version is not passed in.
  <property>
    <name>mapreduce.application.classpath</name>
    <value>/etc/hadoop/${hadoop.version}/mapreduce/*</value>
  </property>

In the meantime, we cannot relies on  mapreduce.admin.map.child.java.opts in mapred-site.xml, because it has its own extra java options specified, which does not apply to Spark.

Author: Zhan Zhang <zhazhan@gmail.com>

Closes #3409 from zhzhan/Spark-4461 and squashes the following commits:

daec3d0 [Zhan Zhang] solve review comments
08f44a7 [Zhan Zhang] add warning in driver mode if spark.yarn.am.extraJavaOptions is configured
5a505d3 [Zhan Zhang] solve review comments
4ed43ad [Zhan Zhang] solve review comments
ad777ed [Zhan Zhang] Merge branch 'master' into Spark-4461
3e9e574 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
e3f9abe [Zhan Zhang] solve review comments
8963552 [Zhan Zhang] rebase
f8f6700 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dea1692 [Zhan Zhang] change the option key name to client mode specific
90d5dff [Zhan Zhang] rebase
8ac9254 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
092a25f [Zhan Zhang] solve review comments
bc5a9ae [Zhan Zhang] solve review comments
782b014 [Zhan Zhang] add new configuration to docs/running-on-yarn.md and remove it from spark-defaults.conf.template
6faaa97 [Zhan Zhang] solve review comments
369863f [Zhan Zhang] clean up unnecessary var
733de9c [Zhan Zhang] Merge branch 'master' into Spark-4461
a68e7f0 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
864505a [Zhan Zhang] Add extra java options to be passed to Yarn application master
15830fc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
685d911 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
03ebad3 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
46d9e3d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
ebb213a [Zhan Zhang] revert
b983ef3 [Zhan Zhang] test
c4efb9b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
779d67b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4daae6d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
12e1be5 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
ce0ca7b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
93f3081 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3764505 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
a9d372b [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
a00f60f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
2014-12-18 10:01:46 -06:00
Peter Vandenabeele 1a9e35e57a [DOCS][SQL] Add a Note on jsonFile having separate JSON objects per line
* This commit hopes to avoid the confusion I faced when trying
  to submit a regular, valid multi-line JSON file, also see

  http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html

Author: Peter Vandenabeele <peter@vandenabeele.com>

Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits:

1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text
6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt"
fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line
2014-12-16 13:58:01 -08:00
Judy Nash 17688d1429 [SQL] SPARK-4700: Add HTTP protocol spark thrift server
Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.

Author: Judy Nash <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>

Closes #3672 from judynash/master and squashes the following commits:

526315d [Judy Nash] correct spacing on startThriftServer method
31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
2b1d312 [Judy Nash] updated http thrift server support based on feedback
377532c [judynash] add HTTP protocol spark thrift server
2014-12-16 12:37:26 -08:00
Mike Jennings d12c0711fa [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
Based on this gist:
https://gist.github.com/amar-analytx/0b62543621e1f246c0a2

We use security group ids instead of security group to get around this issue:
https://github.com/boto/boto/issues/350

Author: Mike Jennings <mvj101@gmail.com>
Author: Mike Jennings <mvj@google.com>

Closes #2872 from mvj101/SPARK-3405 and squashes the following commits:

be9cb43 [Mike Jennings] `pep8 spark_ec2.py` runs cleanly.
4dc6756 [Mike Jennings] Remove duplicate comment
731d94c [Mike Jennings] Update for code review.
ad90a36 [Mike Jennings] Merge branch 'master' of https://github.com/apache/spark into SPARK-3405
1ebffa1 [Mike Jennings] Merge branch 'master' into SPARK-3405
52aaeec [Mike Jennings] [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
2014-12-16 12:13:21 -08:00
Ryan Williams 8176b7a02e [SPARK-4668] Fix some documentation typos.
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #3523 from ryan-williams/tweaks and squashes the following commits:

d2eddaa [Ryan Williams] code review feedback
ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
c6cfad9 [Ryan Williams] remove unnecessary if statement
b74ea35 [Ryan Williams] comment fix
b0221f0 [Ryan Williams] fix a gendered pronoun
c71ffed [Ryan Williams] use names on a few boolean parameters
89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
83e8358 [Ryan Williams] fix pom.xml typo
dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
2014-12-15 14:52:17 -08:00
Tathagata Das b004150adb [SPARK-4806] Streaming doc update for 1.2
Important updates to the streaming programming guide
- Make the fault-tolerance properties easier to understand, with information about write ahead logs
- Update the information about deploying the spark streaming app with information about Driver HA
- Update Receiver guide to discuss reliable vs unreliable receivers.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>

Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits:

f53154a [Tathagata Das] Addressed Josh's comments.
ce299e4 [Tathagata Das] Minor update.
ca19078 [Tathagata Das] Minor change
f746951 [Tathagata Das] Mentioned performance problem with WAL
7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information.
2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide.
91aa5aa [Tathagata Das] Improved API Docs menu
5707581 [Tathagata Das] Added Pythn API badge
b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide
b8c8382 [Josh Rosen] minor fixes
a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings
65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section.
f015397 [Josh Rosen] Minor grammar / pluralization fixes.
3019f3a [Josh Rosen] Fix minor Markdown formatting issues
aa8bb87 [Tathagata Das] Small update.
195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration.
17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2
a0217c0 [Tathagata Das] Changed Deploying menu layout
67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide.
e45453b [Tathagata Das] Update streaming guide, added deploying section.
192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
2014-12-11 06:21:23 -08:00
Andrew Ash 652b781a9b SPARK-3526 Add section about data locality to the tuning guide
cc kayousterhout

I have a few outstanding questions from compiling this documentation:
- What's the difference between NO_PREF and ANY?  I understand the implications of the ordering but don't know what an example of each would be
- Why is NO_PREF ahead of RACK_LOCAL?  I would think it'd be better to schedule rack-local tasks ahead of no preference if you could only do one or the other.  Is the idea to wait longer and hope for the rack-local tasks to turn into node-local or better?
- Will there be a datacenter-local locality level in the future?  Apache Cassandra for example has this level

Author: Andrew Ash <andrew@andrewash.com>

Closes #2519 from ash211/SPARK-3526 and squashes the following commits:

44cff28 [Andrew Ash] Link to spark.locality parameters rather than copying the list
6d5d966 [Andrew Ash] Stay focused on Spark, no astronaut architecture mumbo-jumbo
20e0e31 [Andrew Ash] SPARK-3526 Add section about data locality to the tuning guide
2014-12-10 15:01:15 -08:00
Andrew Or 56212831c6 [SPARK-4771][Docs] Document standalone cluster supervise mode
tdas looks like streaming already refers to the supervise mode. The link from there is broken though.

Author: Andrew Or <andrew@databricks.com>

Closes #3627 from andrewor14/document-supervise and squashes the following commits:

9ca0908 [Andrew Or] Wording changes
2b55ed2 [Andrew Or] Document standalone cluster supervise mode
2014-12-10 12:41:36 -08:00
Sandy Ryza 912563aa35 SPARK-4338. [YARN] Ditch yarn-alpha.
Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:

1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
2014-12-09 11:02:43 -08:00
Sandy Ryza cda94d15ea SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc...
...umented default is incorrect for YARN

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits:

bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
2014-12-08 16:28:36 -08:00
CrazyJvm 6eb1b6f620 Streaming doc : do you mean inadvertently?
Author: CrazyJvm <crazyjvm@gmail.com>

Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits:

b72886b [CrazyJvm] do you mean inadvertently?
2014-12-05 13:42:13 -08:00
Andrew Or fd8525334c Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster"
This reverts commit a975dc3279.
2014-12-04 21:53:49 -08:00
Masayoshi TSUZUKI ca379039f7 [SPARK-4464] Description about configuration options need to be modified in docs.
Added description about -h and -host.
Modified description about -i and -ip which are now deprecated.
Added description about --properties-file.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits:

6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
2014-12-04 19:33:02 -08:00
Andy Konwinski 15cf3b0125 Fix typo in Spark SQL docs.
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #3611 from andyk/patch-3 and squashes the following commits:

7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
2014-12-04 18:27:02 -08:00
Masayoshi TSUZUKI ddfc09c363 [SPARK-4421] Wrong link in spark-standalone.html
Modified the link of building Spark.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits:

56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
2014-12-04 18:14:36 -08:00
lewuathe ab8177da2d [SPARK-4652][DOCS] Add docs about spark-git-repo option
There might be some cases when WIPS spark version need to be run
on EC2 cluster. In order to setup this type of cluster more easily,
add --spark-git-repo option description to ec2 documentation.

Author: lewuathe <lewuathe@me.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits:

6dae8ee [lewuathe] Wrap consistent with other descriptions
cfaf9be [lewuathe] Add docs about spark-git-repo option

(Editing / cleanup by Josh Rosen)
2014-12-04 15:24:36 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 529439bd50 [docs] Fix outdated comment in tuning guide
When you use the SPARK_JAVA_OPTS env variable, Spark complains:

```
SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
```

This updates the docs to redirect the user to the relevant part of the configuration docs.

CC: mengxr  but please CC someone else as needed

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3592 from jkbradley/tuning-doc and squashes the following commits:

0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
2014-12-04 00:59:32 -08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00
Joseph K. Bradley 27ab0b8a03 [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer
I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3569 from jkbradley/lr-doc and squashes the following commits:

654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization
5035ad0 [Joseph K. Bradley] updated based on review
94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
2014-12-04 08:58:03 +08:00
Masayoshi TSUZUKI 692f49378f [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.
Added descriptions about these parameters.
- spark.yarn.queue

Modified description about the defalut value of this parameter.
- spark.yarn.submit.file.replication

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits:

ce99655 [Masayoshi TSUZUKI] better gramatically.
21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties.
88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update
2014-12-03 13:16:24 -08:00
Jim Lim a975dc3279 SPARK-2624 add datanucleus jars to the container in yarn-cluster
If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container.

This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container.

Author: Jim Lim <jim@quixey.com>

Closes #3238 from jimjh/SPARK-2624 and squashes the following commits:

3633071 [Jim Lim] SPARK-2624 update documentation and comments
fe95125 [Jim Lim] SPARK-2624 keep java imports together
6c31fe0 [Jim Lim] SPARK-2624 update documentation
6690fbf [Jim Lim] SPARK-2624 add tests
d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option
84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
2014-12-03 11:16:29 -08:00
Kay Ousterhout d9a148ba6a [SPARK-4686] Link to allowed master URLs is broken
The link points to the old scala programming guide; it should point to the submitting applications page.

This should be backported to 1.1.2 (it's been broken as of 1.0).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits:

a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken
2014-12-02 09:06:02 -08:00
Daoyuan Wang 5edbcbfb61 [SQL][DOC] Date type in SQL programming guide
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3535 from adrian-wang/datedoc and squashes the following commits:

18ff1ed [Daoyuan Wang] [DOC] Date type
2014-12-01 14:04:07 -08:00
wangfei 7b79957879 [SQL] Minor fix for doc and comment
Author: wangfei <wangfei1@huawei.com>

Closes #3533 from scwf/sql-doc1 and squashes the following commits:

962910b [wangfei] doc and comment fix
2014-12-01 14:02:02 -08:00
Cheng Lian 5db8dcaf49 [SPARK-4258][SQL][DOC] Documents spark.sql.parquet.filterPushdown
Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits:

2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown
2014-12-01 13:09:51 -08:00
Madhu Siddalingaiah 2b233f5fc4 Documentation: add description for repartitionAndSortWithinPartitions
Author: Madhu Siddalingaiah <madhu@madhu.com>

Closes #3390 from msiddalingaiah/master and squashes the following commits:

cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions
2014-12-01 08:45:34 -08:00
Cheng Lian 2a4d389f70 [DOC] Fixes formatting typo in SQL programming guide
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits:

865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide
2014-11-30 19:04:07 -08:00
lewuathe a217ec5fd5 [SPARK-4656][Doc] Typo in Programming Guide markdown
Grammatical error in Programming Guide document

Author: lewuathe <lewuathe@me.com>

Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits:

a3e2f00 [lewuathe] Typo in Programming Guide markdown
2014-11-30 17:18:50 -08:00
Takuya UESHIN 0fcd24cc54 [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits:

1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
2014-11-30 00:10:31 -05:00
Sean Owen 5d7fe178b3 SPARK-4170 [CORE] Closure problems when running Scala app that "extends App"
Warn against subclassing scala.App, and remove one instance of this in examples

Author: Sean Owen <sowen@cloudera.com>

Closes #3497 from srowen/SPARK-4170 and squashes the following commits:

4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
2014-11-27 09:03:17 -08:00
CodingCat 5af53ada65 [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator
https://issues.apache.org/jira/browse/SPARK-3628

In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive

In this patch, I changed the way for the DAGScheduler to update the accumulator,

DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...

Author: CodingCat <zhunansjtu@gmail.com>

Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits:

701a1e8 [CodingCat] roll back change on Accumulator.scala
1433e6f [CodingCat] make MIMA happy
b233737 [CodingCat] address Matei's comments
02261b8 [CodingCat] rollback  some changes
6b0aff9 [CodingCat] update document
2b2e8cf [CodingCat] updateAccumulator
83b75f8 [CodingCat] style fix
84570d2 [CodingCat] re-enable  the bad accumulator guard
1e9e14d [CodingCat] add NPE guard
21b6840 [CodingCat] simplify the patch
88d1f03 [CodingCat] fix rebase error
f74266b [CodingCat] add test case for resubmitted result stage
5cf586f [CodingCat] de-duplicate on task level
138f9b3 [CodingCat] make MIMA happy
67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
2014-11-26 16:52:04 -08:00
Xiangrui Meng 7eba0fbe45 [Spark-4509] Revert EC2 tag-based cluster membership patch
This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted:

SPARK-2333: 94053a7b76
SPARK-3213: 7faf755ae4
SPARK-3608: 78d4220fa0.

I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1:

https://github.com/apache/spark/pull/2225/files

JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3453 from mengxr/SPARK-4509 and squashes the following commits:

f0b708b [Xiangrui Meng] revert 94053a7b76
4298ea5 [Xiangrui Meng] revert 7faf755ae4
35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds"
2014-11-25 16:07:09 -08:00
Andrew Or 9afcbe494a [SPARK-4546] Improve HistoryServer first time user experience
The documentation points the user to run the following
```
sbin/start-history-server.sh
```
The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`.

This is what it looks like as of this PR:

![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)

Author: Andrew Or <andrew@databricks.com>

Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits:

f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist
fc4c17a [Andrew Or] Improve HistoryServer UX
2014-11-25 15:48:02 -08:00
arahuja d240760191 [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first
The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter

Author: arahuja <aahuja11@gmail.com>

Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits:

51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
2014-11-25 08:23:41 -06:00
wangfei 0fe54cff19 [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12
Author: wangfei <wangfei1@huawei.com>

Closes #3335 from scwf/patch-10 and squashes the following commits:

d343113 [wangfei] add '-Phive'
60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
2014-11-24 22:32:39 -08:00
Sandy Ryza 29372b6318 SPARK-4457. Document how to build for Hadoop versions greater than 2.4
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits:

5e72b77 [Sandy Ryza] Feedback
0cf05c1 [Sandy Ryza] Caveat
be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
2014-11-24 13:28:48 -06:00
Reynold Xin 28fdc6f682 [Doc][GraphX] Remove unused png files. 2014-11-21 00:30:58 -08:00
Reynold Xin b97070ec78 [Doc][GraphX] Remove Motivation section and did some minor update. 2014-11-21 00:29:02 -08:00
Davies Liu 8cd6eea629 add Sphinx as a dependency of building docs
Author: Davies Liu <davies@databricks.com>

Closes #3388 from davies/doc_readme and squashes the following commits:

daa1482 [Davies Liu] add Sphinx dependency
2014-11-20 19:13:16 -08:00
Joseph E. Gonzalez 377b068209 Updating GraphX programming guide and documentation
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:

4421964 [Joseph E. Gonzalez] updating documentation for graphx
2014-11-19 16:53:33 -08:00
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
Josh Rosen 0f3ceb56c7 [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts
This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).

**The solution implemented here is only a partial fix.**  A complete fix would have the following properties:

1. Only one SparkContext may ever be under construction at any given time.
2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.

This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.

### The correct solution:

I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:

```scala
class SparkContext private (deps: SparkContextDependencies) {
  def this(conf: SparkConf) {
    this(SparkContext.getDeps(conf))
  }
}

object SparkContext(
  private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
    if (anotherSparkContextIsActive) { throw Exception(...) }
    var dagScheduler: DAGScheduler = null
    try {
        dagScheduler = new DAGScheduler(...)
        [...]
    } catch {
      case e: Exception =>
         Option(dagScheduler).foreach(_.stop())
          [...]
    }
    SparkContextDependencies(dagScheduler, ....)
  }
}
```

This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.

This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.

### Alternative solutions:

As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.

The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.

### This PR's solution:

- At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
- If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
- At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.

This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.

This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:

23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
d38251b [Josh Rosen] Address latest round of feedback.
c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
85a424a [Josh Rosen] Incorporate more review feedback.
372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
f5bb78c [Josh Rosen] Update mvn build, too.
d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
79a7e6f [Josh Rosen] Fix commented out test
a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
2014-11-17 12:48:18 -08:00
Andy Konwinski cec1116b4b [DOCS][SQL] Fix broken link to Row class scaladoc
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #3323 from andyk/patch-2 and squashes the following commits:

4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc
2014-11-17 11:52:47 -08:00
zsxwing 861223ee5b [SPARK-4363][Doc] Update the Broadcast example
Author: zsxwing <zsxwing@gmail.com>

Closes #3226 from zsxwing/SPARK-4363 and squashes the following commits:

8109914 [zsxwing] Update the Broadcast example
2014-11-14 22:28:48 -08:00
Sandy Ryza f5f757e4ed SPARK-4375. no longer require -Pscala-2.10
It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits:

0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
cd42d94 [Sandy Ryza] Update doc
f6644c3 [Sandy Ryza] SPARK-4375 take 2
2014-11-14 14:21:57 -08:00
WangTao e421072da0 [SPARK-3722][Docs]minor improvement and fix in docs
https://issues.apache.org/jira/browse/SPARK-3722

Author: WangTao <barneystinson@aliyun.com>

Closes #2579 from WangTaoTheTonic/docsWork and squashes the following commits:

6f91cec [WangTao] use more wording express
29d22fa [WangTao] delete the specified version link
34cb4ea [WangTao] Update running-on-yarn.md
4ee1a26 [WangTao] minor improvement and fix in docs
2014-11-14 08:09:42 -06:00
Prashant Sharma daaca14c16 Support cross building for Scala 2.11
Let's give this another go using a version of Hive that shades its JLine dependency.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:

e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
2014-11-11 21:36:48 -08:00
Kousuke Saruta 3c07b8f082 [SPARK-4330][Doc] Link to proper URL for YARN overview
In running-on-yarn.md, a link to YARN overview is here.
But the URL is to YARN alpha's.
It should be stable's.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3196 from sarutak/SPARK-4330 and squashes the following commits:

30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN
2014-11-10 22:18:00 -08:00
Sandy Ryza c6f4e70421 SPARK-4230. Doc for spark.default.parallelism is incorrect
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits:

37a1d19 [Sandy Ryza] Clear up a couple things
34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect
2014-11-10 12:40:41 -08:00
Sean Owen 8c99a47a4f SPARK-971 [DOCS] Link to Confluence wiki from project website / documentation
This is a trivial change to add links to the wiki from `README.md` and the main docs page. It is already linked to from spark.apache.org.

Author: Sean Owen <sowen@cloudera.com>

Closes #3169 from srowen/SPARK-971 and squashes the following commits:

dcb84d0 [Sean Owen] Add link to wiki from README, docs home page
2014-11-09 17:40:48 -08:00
wangfei 636d7bcc96 [SQL][DOC][Minor] Spark SQL Hive now support dynamic partitioning
Author: wangfei <wangfei1@huawei.com>

Closes #3127 from scwf/patch-9 and squashes the following commits:

e39a560 [wangfei] now support dynamic partitioning
2014-11-07 11:43:35 -08:00
jay@apache.org 868cd4c3ca SPARK-4040. Update documentation to exemplify use of local (n) value, fo...
This is a minor docs update which helps to clarify the way local[n] is used for streaming apps.

Author: jay@apache.org <jayunit100>

Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits:

35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value.
2014-11-05 15:45:34 -08:00
Davies Liu c8abddc516 [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.
```

Author: Davies Liu <davies@databricks.com>

Closes #3091 from davies/his and squashes the following commits:

145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-04 21:35:52 -08:00
Aaron Davidson 5e73138a01 [SPARK-2938] Support SASL authentication in NettyBlockTransferService
Also lays the groundwork for supporting it inside the external shuffle service.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3087 from aarondav/sasl and squashes the following commits:

3481718 [Aaron Davidson] Delete rogue println
44f8410 [Aaron Davidson] Delete documentation - muahaha!
eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level
a6b95f1 [Aaron Davidson] Address comments
785bbde [Aaron Davidson] Cleanup
79973cb [Aaron Davidson] Remove unused file
151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
7b42adb [Aaron Davidson] Add unit tests
8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
2014-11-04 16:15:38 -08:00
Dariusz Kobylarz bcecd73fdd fixed MLlib Naive-Bayes java example bug
the filter tests Double objects by references whereas it should test their values

Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>

Closes #3081 from dkobylarz/master and squashes the following commits:

5d43a39 [Dariusz Kobylarz] naive bayes example update
a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug
2014-11-04 09:53:43 -08:00
Michael Armbrust 25bef7e695 [SQL] More aggressive defaults
- Turns on compression for in-memory cached data by default
 - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
 - Ups the batch size to 10,000 rows
 - Increases the broadcast threshold to 10mb.
 - Uses our parquet implementation instead of the hive one by default.
 - Cache parquet metadata by default.

Author: Michael Armbrust <michael@databricks.com>

Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:

97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults
2014-11-03 14:08:27 -08:00
wangfei 001acc4463 [SPARK-4177][Doc]update build doc since JDBC/CLI support hive 13 now
Author: wangfei <wangfei1@huawei.com>

Closes #3042 from scwf/patch-9 and squashes the following commits:

3784ed1 [wangfei] remove 'TODO'
1891553 [wangfei] update build doc since JDBC/CLI support hive 13
2014-11-02 22:02:05 -08:00
Aaron Davidson 1ae51f6dc7 [SPARK-4183] Enable NettyBlockTransferService by default
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3049 from aarondav/enable-netty and squashes the following commits:

bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
2014-11-02 18:14:57 -08:00
Davies Liu 6181577e99 [SPARK-3466] Limit size of results that a driver collects for each action
Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.

This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.

By default, it's 1g (for backward compatibility for most the cases).

In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.

cc mateiz

Author: Davies Liu <davies@databricks.com>

Closes #3003 from davies/collect and squashes the following commits:

248ed5e [Davies Liu] fix compile
272522e [Davies Liu] address comments
2c35773 [Davies Liu] add sizes in message of abort()
5d62303 [Davies Liu] address comments
bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
11f97c5 [Davies Liu] address comments
47b144f [Davies Liu] check the size of result before send and fetch
3d81af2 [Davies Liu] address comments
ca8267d [Davies Liu] limit the size of data by collect
2014-11-02 00:03:51 -07:00
Patrick Wendell 7894de276b Revert "[SPARK-4183] Enable NettyBlockTransferService by default"
This reverts commit 59e626c701.
2014-11-01 15:18:58 -07:00
Aaron Davidson 59e626c701 [SPARK-4183] Enable NettyBlockTransferService by default
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3049 from aarondav/enable-netty and squashes the following commits:

bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
2014-11-01 13:15:24 -07:00
freeman 98c556ebbc Streaming KMeans [MLLIB][SPARK-3254]
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors

tdas mengxr rezazadeh

Author: freeman <the.freeman.lab@gmail.com>
Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits:

b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay
2014-10-31 22:30:12 -07:00
Anant e07fb6a41e [SPARK-3838][examples][mllib][python] Word2Vec example in python
This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838

Python example for word2vec
mengxr

Author: Anant <anant.asty@gmail.com>

Closes #2952 from anantasty/SPARK-3838 and squashes the following commits:

87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec
2014-10-31 18:33:19 -07:00
Kousuke Saruta 4d52cec21d [SPARK-4089][Doc][Minor] The version number of Spark in _config.yaml is wrong.
The version number of Spark in docs/_config.yaml for master branch should be 1.2.0 for now.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2943 from sarutak/SPARK-4089 and squashes the following commits:

aba7fb4 [Kousuke Saruta] Fixed the version number of Spark in _config.yaml
2014-10-28 12:44:12 -07:00
Davies Liu fae095bc7c [SPARK-3961] [MLlib] [PySpark] Python API for mllib.feature
Added completed Python API for MLlib.feature

Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF

cc mengxr

Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2819 from davies/feature and squashes the following commits:

4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
2014-10-28 03:50:22 -07:00
Prashant Sharma c9e05ca27c [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2878 from ScrapCodes/SPARK-4032/deprecate-yarn-alpha and squashes the following commits:

17e9857 [Prashant Sharma] added deperecated comment to Client and ExecutorRunnable.
3a34b1e [Prashant Sharma] Updated docs...
4608dea [Prashant Sharma] [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
2014-10-27 10:02:48 -07:00
Kousuke Saruta dc51f4d6d8 [SQL][DOC] Wrong package name "scala.math.sql" in sql-programming-guide.md
In sql-programming-guide.md, there is a wrong package name "scala.math.sql".

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2873 from sarutak/wrong-packagename-fix and squashes the following commits:

4d5ecf4 [Kousuke Saruta] Fixed wrong package name in sql-programming-guide.md
2014-10-26 16:27:29 -07:00
Josh Rosen 9530316887 [SPARK-2321] Stable pull-based progress / status API
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)).  For now, I'd like to discuss the basic implementation, API names, and overall interface design.  Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.

#### Design goals:

- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.

#### Implementation:

- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values.  These interfaces consist entirely of Java-style getter methods.  The interfaces are currently implemented in Java.  I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI.  This allows us to re-use this listeners in the implementation of this status API.  There are a few reasons why this listener re-use makes sense:
   - The status API and web UI are guaranteed to show consistent information.
   - These listeners are already well-tested.
   - The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.

The progress API methods are implemented in a separate trait that's mixed into SparkContext.  This helps to avoid SparkContext.scala from becoming larger and more difficult to read.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>

Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:

e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
2014-10-25 00:06:57 -07:00
Zhan Zhang 7c89a8f0c8 [SPARK-2706][SQL] Enable Spark to support Hive 0.13
Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.

Approach: Introduce “hive-version” property,  and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,

1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1

2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0)  if no hive.version is specified.

3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.

4. With this approach, nothing is changed with current hive-0.12 support.

No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).

Author: Zhan Zhang <zhazhan@gmail.com>
Author: zhzhan <zhazhan@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2241 from zhzhan/spark-2706 and squashes the following commits:

3ece905 [Zhan Zhang] minor fix
410b668 [Zhan Zhang] solve review comments
cbb4691 [Zhan Zhang] change run-test for new options
0d4d2ed [Zhan Zhang] rebase
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default
ab028d1 [Zhan Zhang] rebase
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241
b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706
2b50502 [Zhan Zhang] rebase
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb22863 [Zhan Zhang] correct the typo
20f6cf7 [Zhan Zhang] solve compatability issue
f7912a9 [Zhan Zhang] rebase and solve review feedback
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
10c3565 [Zhan Zhang] address review comments
6bc9204 [Zhan Zhang] rebase and remove temparory repo
d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ced0d7 [Zhan Zhang] rebase
d9b981d [Zhan Zhang] rebase and fix error due to rollback
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dc7bdb3 [Zhan Zhang] solve conflicts
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d48bd18 [Zhan Zhang] address review comments
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
9412d24 [Zhan Zhang] address review comments
f4af934 [Zhan Zhang] rebase
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
5f5619f [Zhan Zhang] restructure the directory and different hive version support
05d3683 [Zhan Zhang] solve conflicts
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark
87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master'
70964fe [Zhan Zhang] revert
fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
70ffd93 [Zhan Zhang] revert
42585ec [Zhan Zhang] test
7d5fce2 [Zhan Zhang] test
2014-10-24 11:03:17 -07:00
Kousuke Saruta f799700eec [SPARK-4055][MLlib] Inconsistent spelling 'MLlib' and 'MLLib'
Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2903 from sarutak/SPARK-4055 and squashes the following commits:

b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
2014-10-23 09:19:32 -07:00
Sandy Ryza 6bb56faea8 SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
Author: Sandy Ryza <sandy@cloudera.com>

Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:

48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
2014-10-21 21:53:09 -07:00
Josh Rosen 7e63bb49c5 [SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport)
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`.  The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops.  The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks.  Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits:

f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf.
b562451 [Josh Rosen] Remove unused jobConfCacheKey field.
dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

(cherry picked from commit 2cd40db2b3)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
2014-10-19 00:35:05 -07:00
Davies Liu 05db2da7dc [SPARK-3952] [Streaming] [PySpark] add Python examples in Streaming Programming Guide
Having Python examples in Streaming Programming Guide.

Also add RecoverableNetworkWordCount example.

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2808 from davies/pyguide and squashes the following commits:

8d4bec4 [Davies Liu] update readme
26a7e37 [Davies Liu] fix format
3821c4d [Davies Liu] address comments, add missing file
7e4bb8a [Davies Liu] add Python examples in Streaming Programming Guide
2014-10-18 19:14:48 -07:00
WangTaoTheTonic e7f4ea8a52 [SPARK-3890][Docs]remove redundant spark.executor.memory in doc
Introduced in f7e79bc42c, I'm not sure why we need two spark.executor.memory here.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #2745 from WangTaoTheTonic/redundantconfig and squashes the following commits:

e7564dc [WangTao] too long line
fdbdb1f [WangTaoTheTonic] trivial workaround
d06b6e5 [WangTaoTheTonic] remove redundant spark.executor.memory in doc
2014-10-16 19:12:57 -07:00
Aaron Davidson 7f7b50ed9d [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2784 from aarondav/fix-timeout and squashes the following commits:

bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
2014-10-16 18:58:18 -07:00
GuoQiang Li 293a0b5dbb [SPARK-2098] All Spark processes should support spark-defaults.conf, config file
This is another implementation about #1256
cc andrewor14 vanzin

Author: GuoQiang Li <witgo@qq.com>

Closes #2379 from witgo/SPARK-2098-new and squashes the following commits:

4ef1cbd [GuoQiang Li] review commit
49ef70e [GuoQiang Li] Refactor getDefaultPropertiesFile
c45d20c [GuoQiang Li] All Spark processes should support spark-defaults.conf, config file
2014-10-14 22:16:38 -07:00
Sean Owen 18ab6bd709 SPARK-1307 [DOCS] Don't use term 'standalone' to refer to a Spark Application
HT to Diana, just proposing an implementation of her suggestion, which I rather agreed with. Is there a second/third for the motion?

Refer to "self-contained" rather than "standalone" apps to avoid confusion with standalone deployment mode. And fix placement of reference to this in MLlib docs.

Author: Sean Owen <sowen@cloudera.com>

Closes #2787 from srowen/SPARK-1307 and squashes the following commits:

b5b82e2 [Sean Owen] Refer to "self-contained" rather than "standalone" apps to avoid confusion with standalone deployment mode. And fix placement of reference to this in MLlib docs.
2014-10-14 21:37:51 -07:00
w00228970 92e017fb89 [SPARK-3899][Doc]fix wrong links in streaming doc
There are three  [Custom Receiver Guide] links in streaming doc, the first is wrong.

Author: w00228970 <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #2749 from scwf/streaming-doc and squashes the following commits:

0cd76b7 [wangfei] update link tojump to the Akka-specific section
45b0646 [w00228970] wrong link in streaming doc
2014-10-12 23:35:50 -07:00
Josh Rosen 4e9b551a0b [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython support improvements:
This pull request addresses a few issues related to PySpark's IPython support:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).

There are more details in a block comment in `bin/pyspark`.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:

7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
2014-10-09 16:08:07 -07:00
nartz 13cab5ba44 add spark.driver.memory to config docs
It took me a minute to track this down, so I thought it could be useful to have it in the docs.

I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory.

Author: nartz <nartzpod@gmail.com>
Author: Nathan Artz <nathanartz@Nathans-MacBook-Pro.local>

Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits:

a2f6c62 [nartz] Update configuration.md
74521b8 [Nathan Artz] add spark.driver.memory to config docs
2014-10-09 00:02:11 -07:00
Davies Liu 798ed22c28 [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs
Retire Epydoc, use Sphinx to generate API docs.

Refine Sphinx docs, also convert some docstrings into Sphinx style.

It looks like:
![api doc](https://cloud.githubusercontent.com/assets/40902/4538272/9e2d4f10-4dec-11e4-8d96-6e45a8fe51f9.png)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2689 from davies/docs and squashes the following commits:

bf4a0a5 [Davies Liu] fix links
3fb1572 [Davies Liu] fix _static in jekyll
65a287e [Davies Liu] fix scripts and logo
8524042 [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs
d5b874a [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs
4bc1c3c [Davies Liu] refactor
746d0b6 [Davies Liu] @param -> :param
240b393 [Davies Liu] replace epydoc with sphinx doc
2014-10-07 18:09:27 -07:00
scwf c9ae79fba2 [SPARK-3765][Doc] Add test information to sbt build docs
Add testing with sbt to doc ```building-spark.md```

Author: scwf <wangfei1@huawei.com>

Closes #2629 from scwf/sbt-doc and squashes the following commits:

fd9cf29 [scwf] add testing with sbt to docs
2014-10-05 21:36:28 -07:00
Kousuke Saruta 1eb8389cb4 [SPARK-3763] The example of building with sbt should be "sbt assembly" instead of "sbt compile"
In building-spark.md, there are some examples for making assembled package with maven but the example for building with sbt is only about for compiling.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2627 from sarutak/SPARK-3763 and squashes the following commits:

fadb990 [Kousuke Saruta] Modified the example to build with sbt in building-spark.md
2014-10-03 13:26:30 -07:00
Brenden Matthews a8c52d5343 [SPARK-3535][Mesos] Fix resource handling.
Author: Brenden Matthews <brenden@diddyinc.com>

Closes #2401 from brndnmtthws/master and squashes the following commits:

4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling.
2014-10-03 12:58:04 -07:00
EugenCepoi f0811f928e SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
Update of PR #997.

With this PR, setting SPARK_CONF_DIR overrides SPARK_HOME/conf (not only spark-defaults.conf and spark-env).

Author: EugenCepoi <cepoi.eugen@gmail.com>

Closes #2481 from EugenCepoi/SPARK-2058 and squashes the following commits:

0bb32c2 [EugenCepoi] use orElse orNull and fixing trailing percent in compute-classpath.cmd
77f35d7 [EugenCepoi] SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
2014-10-03 10:03:15 -07:00
scwf c6469a02f1 [SPARK-3766][Doc]Snappy is also the default compress codec for broadcast variables
Author: scwf <wangfei1@huawei.com>

Closes #2632 from scwf/compress-doc and squashes the following commits:

7983a1a [scwf] snappy is the default compression codec for broadcast
2014-10-02 13:47:30 -07:00
Nishkam Ravi b4fb7b80a0 Modify default YARN memory_overhead-- from an additive constant to a multiplier
Redone against the recent master branch (https://github.com/apache/spark/pull/1391)

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes #2485 from nishkamravi2/master_nravi and squashes the following commits:

636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2014-10-02 13:48:35 -05:00
Yin Huai 82a6a083a4 [SQL][Docs] Update the output of printSchema and fix a typo in SQL programming guide.
We have changed the output format of `printSchema`. This PR will update our SQL programming guide to show the updated format. Also, it fixes a typo (the value type of `StructType` in Java API).

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2630 from yhuai/sqlDoc and squashes the following commits:

267d63e [Yin Huai] Update the output of printSchema and fix a typo.
2014-10-02 11:37:24 -07:00
cocoatomo 5b4a5b1acd [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
### Problem

The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
But a folloing command does not run IPython but a default Python executable.

```
$ IPYTHON=1 ./bin/pyspark
Python 2.7.8 (default, Jul  2 2014, 10:14:46)
...
```

the spark/bin/pyspark script on the commit b235e01363 decides which executable and options it use folloing way.

1. if PYSPARK_PYTHON unset
   * → defaulting to "python"
2. if IPYTHON_OPTS set
   * → set IPYTHON "1"
3. some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
   * out of this issues scope
4. if IPYTHON set as "1"
   * → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
   * otherwise execute $PYSPARK_PYTHON

Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.

PYSPARK_PYTHON | IPYTHON_OPTS | IPYTHON | resulting command | expected command
---- | ---- | ----- | ----- | -----
(unset → defaults to python) | (unset) | (unset) | python | (same)
(unset → defaults to python) | (unset) | 1 | python | ipython
(unset → defaults to python) | an_option | (unset → set to 1) | python an_option | ipython an_option
(unset → defaults to python) | an_option | 1 | python an_option | ipython an_option
ipython | (unset) | (unset) | ipython | (same)
ipython | (unset) | 1 | ipython | (same)
ipython | an_option | (unset → set to 1) | ipython an_option | (same)
ipython | an_option | 1 | ipython an_option | (same)

### Suggestion

The pyspark script should determine firstly whether a user wants to run IPython or other executables.

1. if IPYTHON_OPTS set
   * set IPYTHON "1"
2.  if IPYTHON has a value "1"
   * PYSPARK_PYTHON defaults to "ipython" if not set
3. PYSPARK_PYTHON defaults to "python" if not set

See the pull request for more detailed modification.

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2554 from cocoatomo/issues/cannot-run-ipython-without-options and squashes the following commits:

d2a9b06 [cocoatomo] [SPARK-3706][PySpark] Use PYTHONUNBUFFERED environment variable instead of -u option
264114c [cocoatomo] [SPARK-3706][PySpark] Remove the sentence about deprecated environment variables
42e02d5 [cocoatomo] [SPARK-3706][PySpark] Replace environment variables used to customize execution of PySpark REPL
10d56fb [cocoatomo] [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
2014-10-02 11:13:19 -07:00
Davies Liu c5414b6818 [SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

```
============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

This is bugfix of #2351 cc JoshRosen

Author: Davies Liu <davies.liu@gmail.com>

Closes #2556 from davies/profiler and squashes the following commits:

e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
858e74c [Davies Liu] compatitable with python 2.6
7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
2014-09-30 18:24:57 -07:00