Commit graph

1175 commits

Author SHA1 Message Date
sethah 2a9fe4a4e7 [SPARK-6129] [MLLIB] [DOCS] Added user guide for evaluation metrics
Author: sethah <seth.hendrickson16@gmail.com>

Closes #7655 from sethah/Working_on_6129 and squashes the following commits:

253db2d [sethah] removed number formatting from example code
b769cab [sethah] rewording threshold section
d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
c9dd058 [sethah] Cleaning up and formatting metrics user guide section
6f31c21 [sethah] All example code for metrics section done
98813fe [sethah] Most java and python example code added. Further latex formatting
53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
2015-07-29 18:23:07 -07:00
Marcelo Vanzin 31ec6a871e [SPARK-9327] [DOCS] Fix documentation about classpath config options.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7651 from vanzin/SPARK-9327 and squashes the following commits:

2923e23 [Marcelo Vanzin] [SPARK-9327] [docs] Fix documentation about classpath config options.
2015-07-28 11:48:56 -07:00
Alexander Ulanov 90006f3c51 Pregel example type fix
Pregel example to express single source shortest path from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api does not work due to incorrect type. The reason is that `GraphGenerators.logNormalGraph` returns the graph with `Long` vertices. Fixing `val graph: Graph[Int, Double]` to `val graph: Graph[Long, Double]`.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #7695 from avulanov/SPARK-9380-pregel-doc and squashes the following commits:

c269429 [Alexander Ulanov] Pregel example type fix
2015-07-28 01:33:31 +09:00
Carson Wang 6228381657 [SPARK-8405] [DOC] Add how to view logs on Web UI when yarn log aggregation is enabled
Some users may not be aware that the logs are available on Web UI even if Yarn log aggregation is enabled. Update the doc to make this clear and what need to be configured.

Author: Carson Wang <carson.wang@intel.com>

Closes #7463 from carsonwang/YarnLogDoc and squashes the following commits:

274c054 [Carson Wang] Minor text fix
74df3a1 [Carson Wang] address comments
5a95046 [Carson Wang] Update the text in the doc
e5775c1 [Carson Wang] Update doc about how to view the logs on Web UI when yarn log aggregation is enabled
2015-07-27 08:02:40 -05:00
Cheng Lian bebe3f7b45 [SPARK-9207] [SQL] Enables Parquet filter push-down by default
PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now.

Author: Cheng Lian <lian@databricks.com>

Closes #7612 from liancheng/spark-9207 and squashes the following commits:

77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
2015-07-23 17:49:33 -07:00
Josh Rosen b217230f2a [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Spark has an option called spark.localExecution.enabled; according to the docs:

> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.

This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.

This pull request simply brings #7484 up to date.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7585 from rxin/remove-local-exec and squashes the following commits:

84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
2015-07-22 21:04:04 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
MechCoder 89db3c0b6e [SPARK-5989] [MLLIB] Model save/load for LDA
Add support for saving and loading LDA both the local and distributed versions.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6948 from MechCoder/lda_save_load and squashes the following commits:

49bcdce [MechCoder] minor style fixes
cc14054 [MechCoder] minor
4587d1d [MechCoder] Minor changes
c753122 [MechCoder] Load and save the model in private methods
2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
2015-07-21 10:31:31 -07:00
Michael Allman f5b6dc5e3e [SPARK-8401] [BUILD] Scala version switching build enhancements
These commits address a few minor issues in the Scala cross-version support in the build:

  1. Correct two missing `${scala.binary.version}` pom file substitutions.
  2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
  3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
  4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.

This is my original work and I license this work to the Spark project under the Apache License.

Author: Michael Allman <michael@videoamp.com>

Closes #6832 from mallman/scala-versions and squashes the following commits:

cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}
2015-07-21 11:14:31 +01:00
Timothy Chen d86bbb4e28 [SPARK-6284] [MESOS] Add mesos role, principal and secret
Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.

Author: Timothy Chen <tnachen@gmail.com>

Closes #4960 from tnachen/mesos_fw_auth and squashes the following commits:

0f9f03e [Timothy Chen] Fix review comments.
8f9488a [Timothy Chen] Fix rebase
f7fc2a9 [Timothy Chen] Add mesos role, auth and secret.
2015-07-16 19:37:15 -07:00
Shuo Xiang 303c1201c4 [SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide
jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`.

dbtsai I left the code tab for you to add example code. Do you think it is the right place?

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #6504 from coderxiang/elasticnet and squashes the following commits:

f6061ee [Shuo Xiang] typo
90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods
8747190 [Shuo Xiang] merge master
706d3f7 [Shuo Xiang] add python code
9bc2b4c [Shuo Xiang] typo
db32a60 [Shuo Xiang] java code sample
aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
a0dae07 [Shuo Xiang] simplify code
d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge
df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md
78d9366 [Shuo Xiang] address comments
8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet
8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
9262a72 [Shuo Xiang] update
7e07d12 [Shuo Xiang] update
b32f21a [Shuo Xiang] add doc for elastic net in sparkml
937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
2015-07-15 12:10:53 -07:00
FlytxtRnD 3f6296fed4 [SPARK-8018] [MLLIB] KMeans should accept initial cluster centers as param
This allows Kmeans to be initialized using an existing set of cluster centers provided as  a KMeansModel object. This mode of initialization performs a single run.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits:

94b56df [FlytxtRnD] style correction
ef95ee2 [FlytxtRnD] style correction
c446c58 [FlytxtRnD] documentation and numRuns warning change
06d13ef [FlytxtRnD] numRuns corrected
d12336e [FlytxtRnD] numRuns variable modifications
07f8554 [FlytxtRnD] remove setRuns from setIntialModel
e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
242ead1 [FlytxtRnD] corrected == to === in assert
714acb5 [FlytxtRnD] added numRuns
60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed
582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
3f5fc8e [FlytxtRnD] test case modified and one runs condition added
cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018
e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria
6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
2015-07-14 23:29:02 -07:00
zhaishidan c1feebd8fc [SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about spark.kryoserializer.buffer
The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.".

The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.

Author: zhaishidan <zhaishidan@haizhi.com>

Closes #7393 from stanzhai/master and squashes the following commits:

69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb
2015-07-14 08:54:30 +01:00
jose.cambronero 9c5075775d [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
This contribution is my original work and I license it to the project under it's open source license.

Author: jose.cambronero <jose.cambronero@cloudera.com>

Closes #6994 from josepablocam/master and squashes the following commits:

bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
e760ebd [jose.cambronero] line length changes to fit style check
3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
b9cff3a [jose.cambronero] made small changes to pass style check
ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
4da189b [jose.cambronero] added user facing ks test functions
c659ea1 [jose.cambronero] created KS test class
13dfe4d [jose.cambronero] created test result class for ks test
2015-07-10 20:55:45 -07:00
Andrew Or 5dd45bde4a [SPARK-8958] Dynamic allocation: change cached timeout to infinity
pwendell and I discussed this a little more offline and concluded that it would be good to keep it more conservative. Losing cached blocks may be very expensive and we should only allow it if the user knows what he/she is doing.

FYI harishreedharan sryza.

Author: Andrew Or <andrew@databricks.com>

Closes #7329 from andrewor14/da-cached-timeout and squashes the following commits:

cef0b4e [Andrew Or] Change timeout to infinity
2015-07-10 09:48:17 -07:00
Michael Vogiatzis d538919cc4 [DOCS] Added important updateStateByKey details
Runs for *all* existing keys and returning "None" will remove the key-value pair.

Author: Michael Vogiatzis <michaelvogiatzis@gmail.com>

Closes #7229 from mvogiatzis/patch-1 and squashes the following commits:

e7a2946 [Michael Vogiatzis] Updated updateStateByKey text
00283ed [Michael Vogiatzis] Removed space
c2656f9 [Michael Vogiatzis] Moved description farther up
0a42551 [Michael Vogiatzis] Added important updateStateByKey details
2015-07-09 19:54:21 -07:00
Jonathan Alter 28fa01e2ba [SPARK-8927] [DOCS] Format wrong for some config descriptions
A couple descriptions were not inside `<td></td>` and were being displayed immediately under the section title instead of in their row.

Author: Jonathan Alter <jonalter@users.noreply.github.com>

Closes #7292 from jonalter/docs-config and squashes the following commits:

5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
2015-07-09 03:28:51 +01:00
Alok Singh 8f3cd93278 [SPARK-8909][Documentation] Change the scala example in sql-programmi…
…ng-guide#Manually Specifying Options to be in sync with java,python, R version

Author: Alok Singh <“singhal@us.ibm.com”>

Closes #7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits:

d3c20ba [Alok Singh] fix the file to .parquet from .json
d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
2015-07-08 14:51:18 -07:00
Feynman Liang c5532e2fe7 [SPARK-8457] [ML] NGram Documentation
Add documentation for NGram feature transformer.

Author: Feynman Liang <fliang@databricks.com>

Closes #7244 from feynmanliang/SPARK-8457 and squashes the following commits:

5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab
60d5ac0 [Feynman Liang] Inline API doc and fix indentation
736ccbc [Feynman Liang] NGram feature transformer documentation
2015-07-08 14:49:52 -07:00
Shivaram Venkataraman 374c8a8a4a [SPARK-8900] [SPARKR] Fix sparkPackages in init documentation
cc pwendell

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #7293 from shivaram/sparkr-packages-doc and squashes the following commits:

c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
2015-07-08 12:39:32 -07:00
Sun Rui bf02e37716 [SPARK-8894] [SPARKR] [DOC] Example code errors in SparkR documentation.
Author: Sun Rui <rui.sun@intel.com>

Closes #7287 from sun-rui/SPARK-8894 and squashes the following commits:

da63898 [Sun Rui] [SPARK-8894][SPARKR][DOC] Example code errors in SparkR documentation.
2015-07-08 09:48:16 -07:00
Tijo Thomas 08192a1b8a [SPARK-8886][Documentation]python Style update
Fixed comment given by rxin

Author: Tijo Thomas <tijoparacka@gmail.com>

Closes #7281 from tijoparacka/modification_for_python_style and squashes the following commits:

6334e21 [Tijo Thomas] removed space
3de4cd8 [Tijo Thomas] python Style update
2015-07-07 22:35:39 -07:00
Mike Dusenberry 0a63d7ab8a [SPARK-8570] [MLLIB] [DOCS] Improve MLlib Local Matrix Documentation.
Updated MLlib Data Types Local Matrix section to include information on sparse matrices, added sparse matrix examples to the Scala and Java examples, and added Python examples for both dense and sparse matrices.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #6958 from dusenberrymw/Improve_MLlib_Local_Matrix_Documentation and squashes the following commits:

ceae407 [Mike Dusenberry] Updated MLlib Data Types Local Matrix section to include information on sparse matrices, added sparse matrix examples to the Scala and Java examples, and added Python examples for both dense and sparse matrices.
2015-07-07 08:24:52 -07:00
Alok Singh 6718c1eb67 [SPARK-5562] [MLLIB] LDA should handle empty document.
See the jira https://issues.apache.org/jira/browse/SPARK-5562

Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>
Author: Alok  Singh <singhal@aloks-mbp.usca.ibm.com>
Author: Alok Singh <“singhal@us.ibm.com”>

Closes #7064 from aloknsingh/aloknsingh_SPARK-5562 and squashes the following commits:

259a0a7 [Alok Singh] change as per the comments by @jkbradley
be48491 [Alok  Singh] [SPARK-5562][MLlib] re-order import in alphabhetical order
c01311b [Alok  Singh] [SPARK-5562][MLlib] fix the newline typo
b271c8a [Alok  Singh] [SPARK-5562][Mllib] As per github discussion with jkbradley. We would like to simply things.
7c06251 [Alok  Singh] [SPARK-5562][MLlib] modified the JavaLDASuite for test passing
c710cb6 [Alok  Singh] fix the scala code style to have space after :
2572a08 [Alok  Singh] [SPARK-5562][MLlib] change the import xyz._ to the import xyz.{c1, c2} ..
ab55fbf [Alok  Singh] [SPARK-5562][MLlib] Change as per Sean Owen's comments https://github.com/apache/spark/pull/7064/files#diff-9236d23975e6f5a5608ffc81dfd79146
9f4f9ea [Alok  Singh] [SPARK-5562][MLlib] LDA should handle empty document.
2015-07-06 21:53:55 -07:00
Ankur Chauhan 1165b17d24 [SPARK-6707] [CORE] [MESOS] Mesos Scheduler should allow the user to specify constraints based on slave attributes
Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources when trying to determine the usablility of a resource offer from a mesos slave node. It may be preferable for the user to be able to ensure that the spark jobs are only started on a certain set of nodes (based on attributes).

For example, If the user sets a property, let's say `spark.mesos.constraints` is set to `tachyon=true;us-east-1=false`, then the resource offers will be checked to see if they meet both these constraints and only then will be accepted to start new executors.

Author: Ankur Chauhan <achauhan@brightcove.com>

Closes #5563 from ankurcha/mesos_attribs and squashes the following commits:

902535b [Ankur Chauhan] Fix line length
d83801c [Ankur Chauhan] Update code as per code review comments
8b73f2d [Ankur Chauhan] Fix imports
c3523e7 [Ankur Chauhan] Added docs
1a24d0b [Ankur Chauhan] Expand scope of attributes matching to include all data types
482fd71 [Ankur Chauhan] Update access modifier to private[this] for offer constraints
5ccc32d [Ankur Chauhan] Fix nit pick whitespace
1bce782 [Ankur Chauhan] Fix nit pick whitespace
c0cbc75 [Ankur Chauhan] Use offer id value for debug message
7fee0ea [Ankur Chauhan] Add debug statements
fc7eb5b [Ankur Chauhan] Fix import codestyle
00be252 [Ankur Chauhan] Style changes as per code review comments
662535f [Ankur Chauhan] Incorporate code review comments + use SparkFunSuite
fdc0937 [Ankur Chauhan] Decline offers that did not meet criteria
67b58a0 [Ankur Chauhan] Add documentation for spark.mesos.constraints
63f53f4 [Ankur Chauhan] Update codestyle - uniform style for config values
02031e4 [Ankur Chauhan] Fix scalastyle warnings in tests
c09ed84 [Ankur Chauhan] Fixed the access modifier on offerConstraints val to private[mesos]
0c64df6 [Ankur Chauhan] Rename overhead fractions to memory_*, fix spacing
8cc1e8f [Ankur Chauhan] Make exception message more explicit about the source of the error
addedba [Ankur Chauhan] Added test case for malformed constraint string
ec9d9a6 [Ankur Chauhan] Add tests for parse constraint string
72fe88a [Ankur Chauhan] Fix up tests + remove redundant method override, combine utility class into new mesos scheduler util trait
92b47fd [Ankur Chauhan] Add attributes based constraints support to MesosScheduler
2015-07-06 16:04:57 -07:00
Deron Eriksson fcbcba66c9 [SPARK-1564] [DOCS] Added Javascript to Javadocs to create badges for tags like :: Experimental ::
Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags

Author: Deron Eriksson <deron@us.ibm.com>

Closes #7169 from deroneriksson/SPARK-1564_JavaDocs_badges and squashes the following commits:

a8353db [Deron Eriksson] added license headers to api-docs.css and api-javadocs.css
07feb07 [Deron Eriksson] added linebreaks to make jquery more readable when adding html badge tags
65b4930 [Deron Eriksson] Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags
2015-07-02 13:55:53 -07:00
Yanbo Liang 0a468a46bf [SPARK-8758] [MLLIB] Add Python user guide for PowerIterationClustering
Add Python user guide for PowerIterationClustering

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7155 from yanboliang/spark-8758 and squashes the following commits:

18d803b [Yanbo Liang] address comments
dd29577 [Yanbo Liang] Add Python user guide for PowerIterationClustering
2015-07-02 09:59:54 -07:00
Ilya Ganelin 3697232b7d [SPARK-3071] Increase default driver memory
I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place.

Please let me know if I've missed anything.

Will the spark-shell use the value within the command line builder during instantiation?

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #7132 from ilganeli/SPARK-3071 and squashes the following commits:

4074164 [Ilya Ganelin] String fix
271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071
273b6e9 [Ilya Ganelin] Test fix
fd67721 [Ilya Ganelin] Update JavaUtils.java
26cc177 [Ilya Ganelin] test fix
e5db35d [Ilya Ganelin] Fixed test failure
39732a1 [Ilya Ganelin] merge fix
a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each
09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala
19b6f25 [Ilya Ganelin] Missed one doc update
2698a3d [Ilya Ganelin] Updated default value for driver memory
2015-07-01 23:11:02 -07:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Yuhao Yang 2012913355 [SPARK-8308] [MLLIB] add missing save load for python example
jira: https://issues.apache.org/jira/browse/SPARK-8308

1. add some missing save/load in python examples. , LogisticRegression, LinearRegression and NaiveBayes
2. tune down iterations for MatrixFactorization, since current number will trigger StackOverflow for default java configuration (>1M)

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6760 from hhbyyh/docUpdate and squashes the following commits:

9bd3383 [Yuhao Yang] update scala example
8a44692 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
077cbb8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
3e948dc [Yuhao Yang] add missing save load for python example
2015-07-01 11:17:56 -07:00
sethah 8d23587f1d [SPARK-7739] [MLLIB] Improve ChiSqSelector example code in user guide
Author: sethah <seth.hendrickson16@gmail.com>

Closes #7029 from sethah/working_on_SPARK-7739 and squashes the following commits:

ef96916 [sethah] Fixing some style issues
efea1f8 [sethah] adding clarification to ChiSqSelector example
2015-06-30 16:28:25 -07:00
Tijo Thomas 9213f73a8e [SPARK-8615] [DOCUMENTATION] Fixed Sample deprecated code
Modified the deprecated jdbc api in the documentation.

Author: Tijo Thomas <tijoparacka@gmail.com>

Closes #7039 from tijoparacka/JIRA_8615 and squashes the following commits:

6e73b8a [Tijo Thomas] Reverted new lines
4042fcf [Tijo Thomas] updated to sql documentation
a27949c [Tijo Thomas] Fixed Sample deprecated code
2015-06-30 10:50:45 -07:00
MechCoder 45281664e0 [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for StreamingLinearRegressionWithSGD
Python bindings for StreamingLinearRegressionWithSGD

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6744 from MechCoder/spark-4127 and squashes the following commits:

d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
1b4ddd6 [MechCoder] minor
4de6c68 [MechCoder] Minor refactor
5e85a3b [MechCoder] Add tests for simultaneous training and prediction
fb27889 [MechCoder] Add example and docs
505380b [MechCoder] Add tests
d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
2015-06-30 10:25:59 -07:00
Neelesh Srinivas Salian d48e78934a [SPARK-3629] [YARN] [DOCS]: Improvement of the "Running Spark on YARN" document
As per the description in the JIRA, I moved the contents of the page and added a few additional content.

Author: Neelesh Srinivas Salian <nsalian@cloudera.com>

Closes #6924 from nssalian/SPARK-3629 and squashes the following commits:

944b7a0 [Neelesh Srinivas Salian] Changed the lines about deploy-mode and added backticks to all parameters
40dbc0b [Neelesh Srinivas Salian] Changed dfs to HDFS, deploy-mode in backticks and updated the master yarn line
9cbc072 [Neelesh Srinivas Salian] Updated a few lines in the Launching Spark on YARN Section
8e8db7f [Neelesh Srinivas Salian] Removed the changes in this commit to help clearly distinguish movement from update
151c298 [Neelesh Srinivas Salian] SPARK-3629: Improvement of the Spark on YARN document
2015-06-27 09:07:10 +03:00
Rosstin b5a6663da2 [SPARK-8639] [DOCS] Fixed Minor Typos in Documentation
Ticket: [SPARK-8639](https://issues.apache.org/jira/browse/SPARK-8639)

fixed minor typos in docs/README.md and docs/api.md

Author: Rosstin <asterazul@gmail.com>

Closes #7046 from Rosstin/SPARK-8639 and squashes the following commits:

6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
2015-06-27 08:47:00 +03:00
Marcelo Vanzin 37bf76a2de [SPARK-8302] Support heterogeneous cluster install paths on YARN.
Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.

This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.

Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6752 from vanzin/SPARK-8302 and squashes the following commits:

4bff8d4 [Marcelo Vanzin] Add docs, rename configs.
0aa2a02 [Marcelo Vanzin] Only do replacement for paths that need it.
2e9cc9d [Marcelo Vanzin] Style.
a5e1f68 [Marcelo Vanzin] [SPARK-8302] Support heterogeneous cluster install paths on YARN.
2015-06-26 08:45:22 -05:00
Holden Karau 43e66192f4 [SPARK-8506] Add pakages to R context created through init.
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits:

b60dd63 [Holden Karau] Add an example with the spark-csv package
fa8bc92 [Holden Karau] typo: sparm -> spark
865a90c [Holden Karau] strip spaces for comparision
c7a4471 [Holden Karau] Add some documentation
c1a9233 [Holden Karau] refactor for testing
c818556 [Holden Karau] Add pakages to R
2015-06-24 11:55:20 -07:00
Cheng Lian 111d6b9b8a [SPARK-8139] [SQL] Updates docs and comments of data sources and Parquet output committer options
This PR only applies to master branch (1.5.0-SNAPSHOT) since it references `org.apache.parquet` classes which only appear in Parquet 1.7.0.

Author: Cheng Lian <lian@databricks.com>

Closes #6683 from liancheng/output-committer-docs and squashes the following commits:

b4648b8 [Cheng Lian] Removes spark.sql.sources.outputCommitterClass as it's not a public option
ee63923 [Cheng Lian] Updates docs and comments of data sources and Parquet output committer options
2015-06-23 17:24:26 -07:00
Cheng Lian d96d7b5574 [DOC] [SQL] Addes Hive metastore Parquet table conversion section
This PR adds a section about Hive metastore Parquet table conversion. It documents:

1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188)
2. Metadata refreshing requirement introduced in #5339

[1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248

Author: Cheng Lian <lian@databricks.com>

Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits:

42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet
4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet
756e660 [Cheng Lian] Adds Python snippet for metadata refreshing
50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section
2015-06-23 14:19:21 -07:00
Joseph K. Bradley a1894422ad [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4
Reorganized docs a bit.  Added migration guides.

**Q**: Do we want to say more for the 1.3 -> 1.4 migration guide for ```spark.ml```?  It would be a lot.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6897 from jkbradley/ml-guide-1.4 and squashes the following commits:

4bf26d6 [Joseph K. Bradley] tiny fix
8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from previous commit in this PR
6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4
2015-06-21 16:25:25 -07:00
cody koeninger b305e377fb [SPARK-8390] [STREAMING] [KAFKA] fix docs related to HasOffsetRanges
Author: cody koeninger <cody@koeninger.org>

Closes #6863 from koeninger/SPARK-8390 and squashes the following commits:

26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390
3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs
b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing
bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 17:18:31 -07:00
MechCoder 54976e55e3 [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeans
Python bindings for StreamingKMeans

Will change status to MRG once docs, tests and examples are updated.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6499 from MechCoder/spark-4118 and squashes the following commits:

7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
2015-06-19 12:23:15 -07:00
Sean Owen 4be53d0395 [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
2015-06-19 11:03:04 -07:00
Jihong MA ebd363aecd [SPARK-7265] Improving documentation for Spark SQL Hive support
Please review this pull request.

Author: Jihong MA <linlin200605@gmail.com>

Closes #5933 from JihongMA/SPARK-7265 and squashes the following commits:

dfaa971 [Jihong MA] SPARK-7265 minor fix of the content
ace454d [Jihong MA] SPARK-7265 take out PySpark on YARN limitation
9ea0832 [Jihong MA] Merge remote-tracking branch 'upstream/master'
d5bf3f5 [Jihong MA] Merge remote-tracking branch 'upstream/master'
7b842e6 [Jihong MA] Merge remote-tracking branch 'upstream/master'
9c84695 [Jihong MA] SPARK-7265 address review comment
a399aa6 [Jihong MA] SPARK-7265 Improving documentation for Spark SQL Hive support
2015-06-19 14:06:49 +02:00
Lars Francke 4ce3bab89f [SPARK-8462] [DOCS] Documentation fixes for Spark SQL
This fixes various minor documentation issues on the Spark SQL page

Author: Lars Francke <lars.francke@gmail.com>

Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:

dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes
2015-06-18 19:40:32 -07:00
zsxwing 24e53793b4 [SPARK-8376] [DOCS] Add common lang3 to the Spark Flume Sink doc
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

Author: zsxwing <zsxwing@gmail.com>

Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc
2015-06-18 16:00:27 -07:00
Josh Rosen 44c931f006 [SPARK-8353] [DOCS] Show anchor links when hovering over documentation headers
This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:

![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)

This makes it easier for users to link to specific sections of the documentation.

I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6808 from JoshRosen/SPARK-8353 and squashes the following commits:

e59d8a7 [Josh Rosen] Suppress underline on hover
f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code
2015-06-18 15:10:09 -07:00
Neelesh Srinivas Salian ddc5baf17d [SPARK-8320] [STREAMING] Add example in streaming programming guide that shows union of multiple input streams
Added python code to https://spark.apache.org/docs/latest/streaming-programming-guide.html
to the Level of Parallelism in Data Receiving section.

Please review and let me know if there are any additional changes that are needed.

Thank you.

Author: Neelesh Srinivas Salian <nsalian@cloudera.com>

Closes #6862 from nssalian/SPARK-8320 and squashes the following commits:

4bfd126 [Neelesh Srinivas Salian] Changed loop structure to be more in line with Python style
e5345de [Neelesh Srinivas Salian] Changes to kafak append, for loop and show to print()
3fc5c6d [Neelesh Srinivas Salian] SPARK-8320
2015-06-18 09:44:36 -07:00
zsxwing 78a430ea4d [SPARK-7961][SQL]Refactor SQLConf to display better error message
1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`.
2. Verify the value when setting a configuration if this is in SQLConf.
3. Use `SET -v` to display all public configurations.

Author: zsxwing <zsxwing@gmail.com>

Closes #6747 from zsxwing/sqlconf and squashes the following commits:

7d09bad [zsxwing] Use SQLConfEntry in HiveContext
49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext
e014f53 [zsxwing] Merge branch 'master' into sqlconf
93dad8e [zsxwing] Fix the unit tests
cf950c1 [zsxwing] Fix the code style and tests
3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style
a2f4add [zsxwing] getConf will return the default value if a config is not set
037b1db [zsxwing] Add schema to SetCommand
0520c3c [zsxwing] Merge branch 'master' into sqlconf
7afb0ec [zsxwing] Fix the configurations about HiveThriftServer
7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString'
5e95b10 [zsxwing] Add enumConf
c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString
4abd807 [zsxwing] Fix the test for 'set -v'
6e47e56 [zsxwing] Fix the compilation error
8973ced [zsxwing] Remove floatConf
1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead
99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string
88a03cc [zsxwing] Add new lines between confs and return types
ce7c6c8 [zsxwing] Remove seqConf
f3c1b33 [zsxwing] Refactor SQLConf to display better error message
2015-06-17 23:22:54 -07:00
MechCoder 22732e1eca [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProduct
Python API for org.apache.spark.mllib.feature.ElementwiseProduct

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6346 from MechCoder/spark-7605 and squashes the following commits:

79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
2015-06-17 22:08:38 -07:00