We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9493 from yanboliang/docs-coefficients.
Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9450 from JoshRosen/upgrade-to-scala-2.10.5.
The trim_codeblock(lines) function in include_example.rb removes some blank lines in the code.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9400 from yinxusen/SPARK-11443.
I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them.
Kindle Review it.
Author: Rishabh Bhardwaj <rbnext29@gmail.com>
Closes#9353 from rishabhbhardwaj/SPARK-11383.
Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page
CC pwendell
Author: Sean Owen <sowen@cloudera.com>
Closes#9298 from srowen/SPARK-11305.
Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.
shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9290 from felixcheung/rdrivermem.
Recall by threshold snippet was using "precisionByThreshold"
Author: Mageswaran.D <mageswaran1989@gmail.com>
Closes#9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md.
mengxr https://issues.apache.org/jira/browse/SPARK-11289
I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9266 from yinxusen/SPARK-11289.
The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9269 from JoshRosen/SPARK-11299.
A POC code for making example code in user guide testable.
mengxr We still need to talk about the labels in code.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9109 from yinxusen/SPARK-10382.
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8829 from JoshRosen/consolidate-sort-shuffle-implementations.
Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings.
If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties
Author: vundela <vsr@cloudera.com>
Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
Closes#9118 from vundela/master.
This patch fixes a small typo in the GraphX programming guide
Author: Lukasz Piepiora <lpiepiora@gmail.com>
Closes#9160 from lpiepiora/11174-fix-typo-in-graphx-programming-guide.
Add documentation for configuration:
- spark.sql.ui.retainedExecutions
- spark.streaming.ui.retainedBatches
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#9052 from pnpritchard/SPARK-11039.
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:
- **spark.memory.fraction (default 0.75)**: fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region.
- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.
For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.
Author: Andrew Or <andrew@databricks.com>
Closes#9084 from andrewor14/unified-memory-manager.
Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
Author: jerryshao <sshao@hortonworks.com>
Closes#8857 from jerryshao/SPARK-10739 and squashes the following commits:
36eabdc [jerryshao] change the doc
7f9b77d [jerryshao] Style change
1c9afd0 [jerryshao] Address the comments
caca695 [jerryshao] Add application attempt window for Spark on Yarn
This commit improves the documentation around building Spark to
(1) recommend using SBT interactive mode to avoid the overhead of
launching SBT and (2) refer to the wiki page that documents using
SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each
compile.
cc srowen
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#9068 from kayousterhout/SPARK-11056.
1.4 docs noted that the units were MB - i have assumed this is still the case
Author: admackin <admackin@users.noreply.github.com>
Closes#9025 from admackin/master.
In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in 64743870f2/docs/mllib-feature-extraction.md
This JIRA is just for spark.mllib, not spark.ml.
Please let me know if more work is needed, thanks a lot.
Author: Xin Ren <iamshrek@126.com>
Closes#8977 from keypointt/SPARK-10669.
Recommend `--master yarn --deploy-mode {cluster,client}` consistently in docs.
Follow-on to https://github.com/apache/spark/pull/8385
CC nssalian
Author: Sean Owen <sowen@cloudera.com>
Closes#8968 from srowen/SPARK-9570.
jira: https://issues.apache.org/jira/browse/SPARK-10670
In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in 64743870f2/docs/ml-features.md
This JIRA is just for spark.ml, not spark.mllib
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#8901 from hhbyyh/docAPI.
The Scala example under the "Example: Pipeline" heading in this
document initializes the "test" variable to a DataFrame. Because test
is already a DF, there is not need to call test.toDF as the example
does in a subsequent line: model.transform(test.toDF). So, I removed
the extraneous toDF invocation.
Author: Matt Hagen <anonz3000@gmail.com>
Closes#8875 from hagenhaus/SPARK-10663.
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8831 from JoshRosen/remove-ability-to-disable-spilling.
In Spark 1.5.0, Spark SQL is compatible with Hive 0.12.0 through 1.2.1 but the documentation is wrong.
/CC yhuai
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#8776 from sarutak/SPARK-10584-2.
The [published docs for 1.5.0](http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/) have a bunch of test classes in them. The only way I can reproduce this is to `test:compile` before running `unidoc`. To prevent this from happening again, I've added a clean before doc generation.
Author: Michael Armbrust <michael@databricks.com>
Closes#8787 from marmbrus/testsInDocs.
In the Configuration section, the **spark.yarn.driver.memoryOverhead** and **spark.yarn.am.memoryOverhead**‘s default value should be "driverMemory * 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. Because from Spark 1.4.0, the **MEMORY_OVERHEAD_FACTOR** is set to 0.1.0, not 0.07.
Author: yangping.wu <wyphao.2007@163.com>
Closes#8797 from 397090770/SparkOnYarnDocError.
Various ML guide cleanups.
* ml-guide.md: Make it easier to access the algorithm-specific guides.
* LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
* mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
* Clean up Binarizer user guide a little.
* Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
* spark.ml Word2Vec user guide: clean up grammar/writing
* Chi Sq Feature Selector docs: Improve text in doc.
CC: mengxr feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8752 from jkbradley/mlguide-fixes-1.5.
* a follow-up to 16b6d18613 as `--num-executors` flag is not suppported.
* links + formatting
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes#8762 from jaceklaskowski/docs-spark-on-yarn.
Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes#8759 from jaceklaskowski/docs-submitting-apps.
The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#8739 from sarutak/SPARK-10584.
I fixed to use LIBSVM data source in the example code in spark.ml instead of MLUtils
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes#8697 from y-shimizu/SPARK-10518.
spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.
If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.
There are no existing test for YARN mode too. Hence not added test for the same.
Author: Akash Mishra <akash.mishra20@gmail.com>
Closes#8672 from SleepyThread/master.
Small typo in the example for `LabelledPoint` in the MLLib docs.
Author: Sean Paradiso <seanparadiso@gmail.com>
Closes#8680 from sparadiso/docs_mllib_smalltypo.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#8656 from tdas/SPARK-10492 and squashes the following commits:
986cdd6 [Tathagata Das] Added information on backpressure
We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.
Author: Reynold Xin <rxin@databricks.com>
Closes#8161 from rxin/SPARK-9767.
- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#8595 from tdas/SPARK-10440.
Support running pyspark with cluster mode on Mesos!
This doesn't upload any scripts, so if running in a remote Mesos requires the user to specify the script from a available URI.
Author: Timothy Chen <tnachen@gmail.com>
Closes#8349 from tnachen/mesos_python.
SPARK-4223.
Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access.
Manual tests to verify that: "*" works for any user in:
a. Spark ui: view and kill stage. Done.
b. Spark history server. Done.
c. Yarn application killing. Done.
Author: zhuol <zhuol@yahoo-inc.com>
Closes#8398 from zhuoliu/4223.
Migrate Apache download closer.cgi refs to new closer.lua
This is the bit of the change that affects the project docs; I'm implementing the changes to the Apache site separately.
Author: Sean Owen <sowen@cloudera.com>
Closes#8557 from srowen/SPARK-10398.
* The example code was added in 1.2, before `createDataFrame`. This PR switches to `createDataFrame`. Java code still uses JavaBean.
* assume `sqlContext` is available
* fix some minor issues from previous code review
jkbradley srowen feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8518 from mengxr/SPARK-10331.
* replace `ML Dataset` by `DataFrame` to unify the abstraction
* ML algorithms -> pipeline components to describe the main concept
* remove Scala API doc links from the main guide
* `Section Title` -> `Section tile` to be consistent with other section titles in MLlib guide
* modified lines break at 100 chars or periods
jkbradley feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8517 from mengxr/SPARK-10348.
This PR updates the MLlib user guide and adds migration guide for 1.4->1.5.
* merge migration guide for `spark.mllib` and `spark.ml` packages
* remove dependency section from `spark.ml` guide
* move the paragraph about `spark.mllib` and `spark.ml` to the top and recommend `spark.ml`
* move Sam's talk to footnote to make the section focus on dependencies
Minor changes to code examples and other wording will be in a separate PR.
jkbradley srowen feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8498 from mengxr/SPARK-9671.
* Adds user guide for `LinearRegressionSummary`
* Fixes unresolved issues in #8197
CC jkbradley mengxr
Author: Feynman Liang <fliang@databricks.com>
Closes#8491 from feynmanliang/SPARK-9905.
I added a small note about the different types of evaluator and the metrics used.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8304 from MechCoder/multiclass_evaluator.
https://issues.apache.org/jira/browse/SPARK-10287
After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet).
Author: Yin Huai <yhuai@databricks.com>
Closes#8469 from yhuai/jsonRefresh.
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes#8436 from feynmanliang/SPARK-10230.
jira: https://issues.apache.org/jira/browse/SPARK-9901
The jira covers only the document update. I can further provide example code for QR (like the ones for SVD and PCA) in a separate PR.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#8462 from hhbyyh/qrDoc.
https://issues.apache.org/jira/browse/SPARK-10315
this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold'
Author: CodingCat <zhunansjtu@gmail.com>
Closes#8483 from CodingCat/SPARK_10315.
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <fliang@databricks.com>
Closes#8254 from feynmanliang/SPARK-9888.
jira: https://issues.apache.org/jira/browse/SPARK-8531
Update ML user guide for MinMaxScaler
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com>
Closes#7211 from hhbyyh/minmaxdoc.
User guide for spark.ml GBTs and Random Forests.
The examples are copied from the decision tree guide and modified to run.
I caught some issues I had somehow missed in the tree guide as well.
I have run all examples, including Java ones. (Of course, I thought I had previously as well...)
CC: mengxr manishamde yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8369 from jkbradley/ml-ensemble-docs.
Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`.
Author: Keiji Yoshida <yoshida.keiji.84@gmail.com>
Closes#8372 from yosssi/patch-1.
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#8267 from yinxusen/SPARK-9893.
Added user guide for multilayer perceptron classifier:
- Simplified description of the multilayer perceptron classifier
- Example code for Scala and Java
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#8262 from avulanov/SPARK-9846-mlpc-docs.
This allows skipping the code that tries to talk to Hive and HBase to
fetch delegation tokens, in case that somehow conflicts with the application
being run.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8134 from vanzin/SPARK-9833.
1, Add Python example for mllib FP-growth user guide.
2, Correct mistakes of Scala and Java examples.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8279 from yanboliang/spark-10084.
New user guide section ml-decision-tree.md, including code examples.
I have run all examples, including the Java ones.
CC: manishamde yanboliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8244 from jkbradley/ml-dt-docs.
By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label.
I think it is better to make it explicit on documentation.
Author: lewuathe <lewuathe@me.com>
Closes#8205 from Lewuathe/SPARK-9977.
`Lists.newArrayList` -> `Arrays.asList`
CC jkbradley feynmanliang
Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond.
Author: Sean Owen <sowen@cloudera.com>
Closes#8272 from srowen/SPARK-10070.
SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#7831 from avulanov/SPARK-9508-pregel-doc2.
Add a new test case in yarn/ClientSuite which checks how the various SparkConf
and ClientArguments propagate into the ApplicationSubmissionContext.
Author: Dennis Huo <dhuo@google.com>
Closes#8072 from dennishuo/dhuo-yarn-application-tags.
Adds user guide for `PrefixSpan`, including Scala and Java example code.
mengxr zhangjiajin
Author: Feynman Liang <fliang@databricks.com>
Closes#8253 from feynmanliang/SPARK-9898.
Add Python API, user guide and example for ml.feature.ElementwiseProduct.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8061 from yanboliang/SPARK-9768.
Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0.
Author: Reynold Xin <rxin@databricks.com>
Closes#8162 from rxin/SPARK-9934.
… allocation are set. Now, dynamic allocation is set to false when num-executors is explicitly specified as an argument. Consequently, executorAllocationManager in not initialized in the SparkContext.
Author: Niranjan Padmanabhan <niranjan.padmanabhan@cloudera.com>
Closes#7657 from neurons/SPARK-9092.
Some users like to download additional files in their sandbox that they can refer to from their spark program, or even later mount these files to another directory.
Author: Timothy Chen <tnachen@gmail.com>
Closes#7195 from tnachen/mesos_files.
This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#8085 from ericl/docs.
This PR is based on #4229, thanks prabeesh.
Closes#4229
Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>
Closes#7833 from zsxwing/pr4229 and squashes the following commits:
9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
Author: Mahmoud Lababidi <lababidi@gmail.com>
Closes#8076 from lababidi/master and squashes the following commits:
af4553b [Mahmoud Lababidi] Fixed AtmoicReference<> Example
Straightforward fix on doc typo
Author: Jeff Zhang <zjffdu@apache.org>
Closes#8019 from zjffdu/master and squashes the following commits:
aed6e64 [Jeff Zhang] Fix doc typo
spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.
cc marmbrus rxin
Author: Davies Liu <davies@databricks.com>
Closes#7998 from davies/tungsten and squashes the following commits:
c1c16da [Davies Liu] update doc
1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
(cherry picked from commit 4e70e8256c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.
cc marmbrus rxin
Author: Davies Liu <davies@databricks.com>
Closes#7998 from davies/tungsten and squashes the following commits:
c1c16da [Davies Liu] update doc
1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
Document spark.shuffle.service.{enabled,port}
CC sryza tgravescs
This is pretty minimal; is there more to say here about the service?
Author: Sean Owen <sowen@cloudera.com>
Closes#7991 from srowen/SPARK-9641 and squashes the following commits:
3bb946e [Sean Owen] Add link to docs for setup and config of external shuffle service
2302e01 [Sean Owen] Document spark.shuffle.service.{enabled,port}
mengxr This adds the `BlockMatrix` to PySpark. I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits:
27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
Author: Namit Katariya <katariya.namit@gmail.com>
Closes#7935 from namitk/SPARK-9601 and squashes the following commits:
03b5784 [Namit Katariya] [SPARK-9601] Fix signature of JavaPairDStream for stream-stream and windowed join in streaming guide doc
This pull request groups all the prereq requirements into a single section.
cc srowen shivaram
Author: Reynold Xin <rxin@databricks.com>
Closes#7951 from rxin/readme-docs and squashes the following commits:
ab7ded0 [Reynold Xin] Updated docs/README.md to put all prereqs together.
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
4ad6819 [Mike Dusenberry] Documenting the and parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
Author: Sean Owen <sowen@cloudera.com>
Closes#7905 from srowen/SPARK-9521.2 and squashes the following commits:
73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
This puts all the install commands that need to be run in one section instead of being spread over many paragraphs
cc rxin
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#7912 from shivaram/docs-setup-readme and squashes the following commits:
cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs
Add ml.PCA user guide document and code examples for Scala/Java/Python.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7522 from yanboliang/ml-pca-md and squashes the following commits:
60dec05 [Yanbo Liang] address comments
f992abe [Yanbo Liang] Add ml.PCA doc and examples
Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits:
a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults
#7142 made codegen enabled by default so let's modify the corresponding documents.
Closes#7142
Author: KaiXinXiaoLei <huleilei1@huawei.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#7863 from sarutak/SPARK-9535 and squashes the following commits:
0884424 [Kousuke Saruta] Removed a line which mentioned about the effect of codegen enabled
3c11af0 [Kousuke Saruta] Merge branch 'sqlconfig' of https://github.com/KaiXinXiaoLei/spark into SPARK-9535
4ee531d [KaiXinXiaoLei] delete space
4cfd11d [KaiXinXiaoLei] change spark.sql.planner.externalSort
d624cf8 [KaiXinXiaoLei] sql config is wrong
Use print(x) not print x for Python 3 in eval examples
CC sethah mengxr -- just wanted to close this out before 1.5
Author: Sean Owen <sowen@cloudera.com>
Closes#7822 from srowen/SPARK-9490 and squashes the following commits:
01abeba [Sean Owen] Change "print x" to "print(x)" in the rest of the docs too
bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples
This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
Author: zsxwing <zsxwing@gmail.com>
Closes#6955 from zsxwing/kinesis-python and squashes the following commits:
e42e471 [zsxwing] Merge branch 'master' into kinesis-python
455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
5082d28 [zsxwing] Fix the syntax error for Python 2.6
fca416b [zsxwing] Fix wrong comparison
96670ff [zsxwing] Fix the compilation error after merging master
756a128 [zsxwing] Merge branch 'master' into kinesis-python
6c37395 [zsxwing] Print stack trace for debug
7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
cc9d071 [zsxwing] Fix the python test errors
466b425 [zsxwing] Add python tests for Kinesis
e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
3da2601 [zsxwing] Fix the kinesis folder
687446b [zsxwing] Fix the error message and the maven output path
add2beb [zsxwing] Merge branch 'master' into kinesis-python
4957c0b [zsxwing] Add the Python API for Kinesis
Author: sethah <seth.hendrickson16@gmail.com>
Closes#7655 from sethah/Working_on_6129 and squashes the following commits:
253db2d [sethah] removed number formatting from example code
b769cab [sethah] rewording threshold section
d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
c9dd058 [sethah] Cleaning up and formatting metrics user guide section
6f31c21 [sethah] All example code for metrics section done
98813fe [sethah] Most java and python example code added. Further latex formatting
53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7651 from vanzin/SPARK-9327 and squashes the following commits:
2923e23 [Marcelo Vanzin] [SPARK-9327] [docs] Fix documentation about classpath config options.
Pregel example to express single source shortest path from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api does not work due to incorrect type. The reason is that `GraphGenerators.logNormalGraph` returns the graph with `Long` vertices. Fixing `val graph: Graph[Int, Double]` to `val graph: Graph[Long, Double]`.
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#7695 from avulanov/SPARK-9380-pregel-doc and squashes the following commits:
c269429 [Alexander Ulanov] Pregel example type fix
Some users may not be aware that the logs are available on Web UI even if Yarn log aggregation is enabled. Update the doc to make this clear and what need to be configured.
Author: Carson Wang <carson.wang@intel.com>
Closes#7463 from carsonwang/YarnLogDoc and squashes the following commits:
274c054 [Carson Wang] Minor text fix
74df3a1 [Carson Wang] address comments
5a95046 [Carson Wang] Update the text in the doc
e5775c1 [Carson Wang] Update doc about how to view the logs on Web UI when yarn log aggregation is enabled
PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now.
Author: Cheng Lian <lian@databricks.com>
Closes#7612 from liancheng/spark-9207 and squashes the following commits:
77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
Spark has an option called spark.localExecution.enabled; according to the docs:
> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
This pull request simply brings #7484 up to date.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#7585 from rxin/remove-local-exec and squashes the following commits:
84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.
- spark.akka.frameSize: This defaults at 10 but is often hit for map
output statuses in large shuffles. This memory is not fully allocated
up-front, so we can just make this larger and still not affect jobs
that never sent a status that large. We increase it to 128.
- spark.executor.memory: Defaults at 512m, which is really small. We
increase it to 1g.
Author: Matei Zaharia <matei@databricks.com>
Closes#7586 from mateiz/configs and squashes the following commits:
ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
Add support for saving and loading LDA both the local and distributed versions.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6948 from MechCoder/lda_save_load and squashes the following commits:
49bcdce [MechCoder] minor style fixes
cc14054 [MechCoder] minor
4587d1d [MechCoder] Minor changes
c753122 [MechCoder] Load and save the model in private methods
2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
These commits address a few minor issues in the Scala cross-version support in the build:
1. Correct two missing `${scala.binary.version}` pom file substitutions.
2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.
This is my original work and I license this work to the Spark project under the Apache License.
Author: Michael Allman <michael@videoamp.com>
Closes#6832 from mallman/scala-versions and squashes the following commits:
cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}
Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.
Author: Timothy Chen <tnachen@gmail.com>
Closes#4960 from tnachen/mesos_fw_auth and squashes the following commits:
0f9f03e [Timothy Chen] Fix review comments.
8f9488a [Timothy Chen] Fix rebase
f7fc2a9 [Timothy Chen] Add mesos role, auth and secret.
jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`.
dbtsai I left the code tab for you to add example code. Do you think it is the right place?
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#6504 from coderxiang/elasticnet and squashes the following commits:
f6061ee [Shuo Xiang] typo
90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods
8747190 [Shuo Xiang] merge master
706d3f7 [Shuo Xiang] add python code
9bc2b4c [Shuo Xiang] typo
db32a60 [Shuo Xiang] java code sample
aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
a0dae07 [Shuo Xiang] simplify code
d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge
df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md
78d9366 [Shuo Xiang] address comments
8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet
8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
9262a72 [Shuo Xiang] update
7e07d12 [Shuo Xiang] update
b32f21a [Shuo Xiang] add doc for elastic net in sparkml
937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run.
Author: FlytxtRnD <meethu.mathew@flytxt.com>
Closes#6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits:
94b56df [FlytxtRnD] style correction
ef95ee2 [FlytxtRnD] style correction
c446c58 [FlytxtRnD] documentation and numRuns warning change
06d13ef [FlytxtRnD] numRuns corrected
d12336e [FlytxtRnD] numRuns variable modifications
07f8554 [FlytxtRnD] remove setRuns from setIntialModel
e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
242ead1 [FlytxtRnD] corrected == to === in assert
714acb5 [FlytxtRnD] added numRuns
60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed
582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
3f5fc8e [FlytxtRnD] test case modified and one runs condition added
cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018
e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria
6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.".
The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.
Author: zhaishidan <zhaishidan@haizhi.com>
Closes#7393 from stanzhai/master and squashes the following commits:
69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb
This contribution is my original work and I license it to the project under it's open source license.
Author: jose.cambronero <jose.cambronero@cloudera.com>
Closes#6994 from josepablocam/master and squashes the following commits:
bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests
2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
e760ebd [jose.cambronero] line length changes to fit style check
3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
b9cff3a [jose.cambronero] made small changes to pass style check
ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
4da189b [jose.cambronero] added user facing ks test functions
c659ea1 [jose.cambronero] created KS test class
13dfe4d [jose.cambronero] created test result class for ks test
pwendell and I discussed this a little more offline and concluded that it would be good to keep it more conservative. Losing cached blocks may be very expensive and we should only allow it if the user knows what he/she is doing.
FYI harishreedharan sryza.
Author: Andrew Or <andrew@databricks.com>
Closes#7329 from andrewor14/da-cached-timeout and squashes the following commits:
cef0b4e [Andrew Or] Change timeout to infinity