and some minor changes in ScalaDoc.
Author: Xiangrui Meng <meng@databricks.com>
Closes#3601 from mengxr/SPARK-4575-fix and squashes the following commits:
c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md
Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)
Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)
CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#3588 from jkbradley/ml-package-docs and squashes the following commits:
d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
When you use the SPARK_JAVA_OPTS env variable, Spark complains:
```
SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
```
This updates the docs to redirect the user to the relevant part of the configuration docs.
CC: mengxr but please CC someone else as needed
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3592 from jkbradley/tuning-doc and squashes the following commits:
0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)
Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
* Use train/test split, and compute test error instead of training error.
* Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming
I have run all examples and relevant unit tests.
CC: mengxr manishamde codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes#3461 from jkbradley/ensemble-docs and squashes the following commits:
70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples
I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3569 from jkbradley/lr-doc and squashes the following commits:
654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization
5035ad0 [Joseph K. Bradley] updated based on review
94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
Added descriptions about these parameters.
- spark.yarn.queue
Modified description about the defalut value of this parameter.
- spark.yarn.submit.file.replication
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#3500 from tsudukim/feature/SPARK-4642 and squashes the following commits:
ce99655 [Masayoshi TSUZUKI] better gramatically.
21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties.
88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update
If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container.
This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container.
Author: Jim Lim <jim@quixey.com>
Closes#3238 from jimjh/SPARK-2624 and squashes the following commits:
3633071 [Jim Lim] SPARK-2624 update documentation and comments
fe95125 [Jim Lim] SPARK-2624 keep java imports together
6c31fe0 [Jim Lim] SPARK-2624 update documentation
6690fbf [Jim Lim] SPARK-2624 add tests
d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option
84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
The link points to the old scala programming guide; it should point to the submitting applications page.
This should be backported to 1.1.2 (it's been broken as of 1.0).
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#3542 from kayousterhout/SPARK-4686 and squashes the following commits:
a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3535 from adrian-wang/datedoc and squashes the following commits:
18ff1ed [Daoyuan Wang] [DOC] Date type
Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits:
2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3498 from liancheng/fix-sql-doc-typo and squashes the following commits:
865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide
Grammatical error in Programming Guide document
Author: lewuathe <lewuathe@me.com>
Closes#3412 from Lewuathe/typo-programming-guide and squashes the following commits:
a3e2f00 [lewuathe] Typo in Programming Guide markdown
To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3361 from ueshin/docs/building-spark_2.11 and squashes the following commits:
1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
Warn against subclassing scala.App, and remove one instance of this in examples
Author: Sean Owen <sowen@cloudera.com>
Closes#3497 from srowen/SPARK-4170 and squashes the following commits:
4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
https://issues.apache.org/jira/browse/SPARK-3628
In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive
In this patch, I changed the way for the DAGScheduler to update the accumulator,
DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...
Author: CodingCat <zhunansjtu@gmail.com>
Closes#2524 from CodingCat/SPARK-732-1 and squashes the following commits:
701a1e8 [CodingCat] roll back change on Accumulator.scala
1433e6f [CodingCat] make MIMA happy
b233737 [CodingCat] address Matei's comments
02261b8 [CodingCat] rollback some changes
6b0aff9 [CodingCat] update document
2b2e8cf [CodingCat] updateAccumulator
83b75f8 [CodingCat] style fix
84570d2 [CodingCat] re-enable the bad accumulator guard
1e9e14d [CodingCat] add NPE guard
21b6840 [CodingCat] simplify the patch
88d1f03 [CodingCat] fix rebase error
f74266b [CodingCat] add test case for resubmitted result stage
5cf586f [CodingCat] de-duplicate on task level
138f9b3 [CodingCat] make MIMA happy
67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted:
SPARK-2333: 94053a7b76
SPARK-3213: 7faf755ae4
SPARK-3608: 78d4220fa0.
I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1:
https://github.com/apache/spark/pull/2225/files
JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution.
Author: Xiangrui Meng <meng@databricks.com>
Closes#3453 from mengxr/SPARK-4509 and squashes the following commits:
f0b708b [Xiangrui Meng] revert 94053a7b76
4298ea5 [Xiangrui Meng] revert 7faf755ae4
35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds"
The documentation points the user to run the following
```
sbin/start-history-server.sh
```
The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`.
This is what it looks like as of this PR:
![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)
Author: Andrew Or <andrew@databricks.com>
Closes#3411 from andrewor14/minor-history-improvements and squashes the following commits:
f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist
fc4c17a [Andrew Or] Improve HistoryServer UX
The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter
Author: arahuja <aahuja11@gmail.com>
Closes#3209 from arahuja/yarn-classpath-first-param and squashes the following commits:
51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
Author: wangfei <wangfei1@huawei.com>
Closes#3335 from scwf/patch-10 and squashes the following commits:
d343113 [wangfei] add '-Phive'
60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3322 from sryza/sandy-spark-4457 and squashes the following commits:
5e72b77 [Sandy Ryza] Feedback
0cf05c1 [Sandy Ryza] Caveat
be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
Author: Davies Liu <davies@databricks.com>
Closes#3388 from davies/doc_readme and squashes the following commits:
daa1482 [Davies Liu] add Sphinx dependency
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes#3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
4421964 [Joseph E. Gonzalez] updating documentation for graphx
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#3277 from vanzin/version-1.3 and squashes the following commits:
7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).
**The solution implemented here is only a partial fix.** A complete fix would have the following properties:
1. Only one SparkContext may ever be under construction at any given time.
2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.
This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.
### The correct solution:
I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object. Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.). Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor. For example:
```scala
class SparkContext private (deps: SparkContextDependencies) {
def this(conf: SparkConf) {
this(SparkContext.getDeps(conf))
}
}
object SparkContext(
private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
if (anotherSparkContextIsActive) { throw Exception(...) }
var dagScheduler: DAGScheduler = null
try {
dagScheduler = new DAGScheduler(...)
[...]
} catch {
case e: Exception =>
Option(dagScheduler).foreach(_.stop())
[...]
}
SparkContextDependencies(dagScheduler, ....)
}
}
```
This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.
This indirection is necessary to maintain binary compatibility. In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.
### Alternative solutions:
As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block. Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block. If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.
The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.
### This PR's solution:
- At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
- If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
- At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context. If so, throw an exception.
This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor). If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.
This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`. The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts. I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#3121 from JoshRosen/SPARK-4180 and squashes the following commits:
23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
d38251b [Josh Rosen] Address latest round of feedback.
c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
85a424a [Josh Rosen] Incorporate more review feedback.
372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
f5bb78c [Josh Rosen] Update mvn build, too.
d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
79a7e6f [Josh Rosen] Fix commented out test
a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
Author: Andy Konwinski <andykonwinski@gmail.com>
Closes#3323 from andyk/patch-2 and squashes the following commits:
4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc
Author: zsxwing <zsxwing@gmail.com>
Closes#3226 from zsxwing/SPARK-4363 and squashes the following commits:
8109914 [zsxwing] Update the Broadcast example
It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3239 from sryza/sandy-spark-4375 and squashes the following commits:
0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
cd42d94 [Sandy Ryza] Update doc
f6644c3 [Sandy Ryza] SPARK-4375 take 2
https://issues.apache.org/jira/browse/SPARK-3722
Author: WangTao <barneystinson@aliyun.com>
Closes#2579 from WangTaoTheTonic/docsWork and squashes the following commits:
6f91cec [WangTao] use more wording express
29d22fa [WangTao] delete the specified version link
34cb4ea [WangTao] Update running-on-yarn.md
4ee1a26 [WangTao] minor improvement and fix in docs
Let's give this another go using a version of Hive that shades its JLine dependency.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#3159 from pwendell/scala-2.11-prashant and squashes the following commits:
e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
In running-on-yarn.md, a link to YARN overview is here.
But the URL is to YARN alpha's.
It should be stable's.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3196 from sarutak/SPARK-4330 and squashes the following commits:
30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3107 from sryza/sandy-spark-4230 and squashes the following commits:
37a1d19 [Sandy Ryza] Clear up a couple things
34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect
This is a trivial change to add links to the wiki from `README.md` and the main docs page. It is already linked to from spark.apache.org.
Author: Sean Owen <sowen@cloudera.com>
Closes#3169 from srowen/SPARK-971 and squashes the following commits:
dcb84d0 [Sean Owen] Add link to wiki from README, docs home page
Author: wangfei <wangfei1@huawei.com>
Closes#3127 from scwf/patch-9 and squashes the following commits:
e39a560 [wangfei] now support dynamic partitioning
This is a minor docs update which helps to clarify the way local[n] is used for streaming apps.
Author: jay@apache.org <jayunit100>
Closes#2964 from jayunit100/SPARK-4040 and squashes the following commits:
35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value.
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes#3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
the filter tests Double objects by references whereas it should test their values
Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>
Closes#3081 from dkobylarz/master and squashes the following commits:
5d43a39 [Dariusz Kobylarz] naive bayes example update
a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug
- Turns on compression for in-memory cached data by default
- Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
- Ups the batch size to 10,000 rows
- Increases the broadcast threshold to 10mb.
- Uses our parquet implementation instead of the hive one by default.
- Cache parquet metadata by default.
Author: Michael Armbrust <michael@databricks.com>
Closes#3064 from marmbrus/fasterDefaults and squashes the following commits:
97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults
Author: wangfei <wangfei1@huawei.com>
Closes#3042 from scwf/patch-9 and squashes the following commits:
3784ed1 [wangfei] remove 'TODO'
1891553 [wangfei] update build doc since JDBC/CLI support hive 13
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3049 from aarondav/enable-netty and squashes the following commits:
bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.
This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.
By default, it's 1g (for backward compatibility for most the cases).
In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.
cc mateiz
Author: Davies Liu <davies@databricks.com>
Closes#3003 from davies/collect and squashes the following commits:
248ed5e [Davies Liu] fix compile
272522e [Davies Liu] address comments
2c35773 [Davies Liu] add sizes in message of abort()
5d62303 [Davies Liu] address comments
bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
11f97c5 [Davies Liu] address comments
47b144f [Davies Liu] check the size of result before send and fetch
3d81af2 [Davies Liu] address comments
ca8267d [Davies Liu] limit the size of data by collect
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3049 from aarondav/enable-netty and squashes the following commits:
bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.
The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors
tdas mengxr rezazadeh
Author: freeman <the.freeman.lab@gmail.com>
Author: Jeremy Freeman <the.freeman.lab@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#2942 from freeman-lab/streaming-kmeans and squashes the following commits:
b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay
This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838
Python example for word2vec
mengxr
Author: Anant <anant.asty@gmail.com>
Closes#2952 from anantasty/SPARK-3838 and squashes the following commits:
87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec
The version number of Spark in docs/_config.yaml for master branch should be 1.2.0 for now.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2943 from sarutak/SPARK-4089 and squashes the following commits:
aba7fb4 [Kousuke Saruta] Fixed the version number of Spark in _config.yaml