Replaced example example code in mllib-dimensionality-reduction.md using
include_example
Author: Devaraj K <devaraj@apache.org>
Closes#11132 from devaraj-kavali/SPARK-13016.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules.
Closes#10602Closes#10897
Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes#11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
## What changes were proposed in this pull request?
This PR fixes some typos in the following documentation files.
* `NOTICE`, `configuration.md`, and `hardware-provisioning.md`.
## How was the this patch tested?
manual tests
Author: Dongjoon Hyun <dongjoonapache.org>
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11289 from dongjoon-hyun/minor_fix_typos_notice_and_confdoc.
## What changes were proposed in this pull request?
Clarify that 0.21 is only a **minimum** requirement.
## How was the this patch tested?
It's a doc change, so no tests.
Author: Iulian Dragos <jaguarul@gmail.com>
Closes#11271 from dragos/patch-1.
Clarify that reduce functions need to be commutative, and fold functions do not
See https://github.com/apache/spark/pull/11091
Author: Sean Owen <sowen@cloudera.com>
Closes#11217 from srowen/SPARK-13339.
Phase 1: update plugin versions, test dependencies, some example and third-party versions
Author: Sean Owen <sowen@cloudera.com>
Closes#11206 from srowen/SPARK-13324.
https://issues.apache.org/jira/browse/SPARK-11627
Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.
Author: junhao <junhao@mogujie.com>
Closes#9593 from junhaoMg/junhao-dev.
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10411 from BenFradet/SPARK-12247.
Replace example code in mllib-pmml-model-export.md using include_example
https://issues.apache.org/jira/browse/SPARK-13018
The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
in the markdown.
See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
Author: Xin Ren <iamshrek@126.com>
Closes#11126 from keypointt/SPARK-13018.
Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312.
This contribution is my original work and I license the work to this project.
Author: JeremyNixon <jnixon2@gmail.com>
Closes#11199 from JeremyNixon/update_train_val_split_example.
Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps.
Author: Amit Dev <amitdev@gmail.com>
Closes#11180 from amitdev/master.
This JIRA is related to
https://github.com/apache/spark/pull/5852
Had to do some minor rework and test to make sure it
works with current version of spark.
Author: Sanket <schintap@untilservice-lm>
Closes#10838 from redsanket/limit-outbound-connections.
When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.
https://issues.apache.org/jira/browse/SPARK-7889
Author: Steve Loughran <stevel@hortonworks.com>
Author: Imran Rashid <irashid@cloudera.com>
Closes#11118 from squito/SPARK-7889-alternate.
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#11149 from sasakitoa/remove_multibyte_in_sparkenv.
Remove spark.closure.serializer option and use JavaSerializer always
CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.
Author: Sean Owen <sowen@cloudera.com>
Closes#11150 from srowen/SPARK-12414.
This is the next iteration of tnachen's previous PR: https://github.com/apache/spark/pull/4027
In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution.
This PR implements two high-level features. These two features are co-dependent, so they're implemented both here:
- Mesos support for spark.executor.cores
- Multiple executors per slave
We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.
The contribution is my original work and I license the work to the project under the project's open source license.
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#10993 from mgummelt/executor_sizing.
Fix for [SPARK-13002](https://issues.apache.org/jira/browse/SPARK-13002) about the initial number of executors when running with dynamic allocation on Mesos.
Instead of fixing it just for the Mesos case, made the change in `ExecutorAllocationManager`. It is already driving the number of executors running on Mesos, only no the initial value.
The `None` and `Some(0)` are internal details on the computation of resources to reserved, in the Mesos backend scheduler. `executorLimitOption` has to be initialized correctly, otherwise the Mesos backend scheduler will, either, create to many executors at launch, or not create any executors and not be able to recover from this state.
Removed the 'special case' description in the doc. It was not totally accurate, and is not needed anymore.
This doesn't fix the same problem visible with Spark standalone. There is no straightforward way to send the initial value in standalone mode.
Somebody knowing this part of the yarn support should review this change.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#11047 from skyluc/issue/initial-dyn-alloc-2.
Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings.
Author: Timothy Chen <tnachen@gmail.com>
Closes#10057 from tnachen/fix_mesos_dir.
ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#10901 from maropu/DocFix.
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10608 from JoshRosen/SPARK-6363.
this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths."
Author: James Lohse <jimlohse@users.noreply.github.com>
Closes#10890 from jimlohse/patch-1.
JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables
Author: Andrew <weiner.andrew.j@gmail.com>
Closes#10869 from weineran/branch-yarn-docs.
Since `actorStream` is an external project, we should add the linking and deploying instructions for it.
A follow up PR of #10744
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10856 from zsxwing/akka-link-instruction.
Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
CC rxin pwendell for API change; tdas since it also touches streaming.
Author: Sean Owen <sowen@cloudera.com>
Closes#10413 from srowen/SPARK-3369.
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10222 from yanboliang/spark-11965.
Clarify that modifying a driver local variable won't have the desired effect in cluster modes, and may or may not work as intended in local mode
Author: Sean Owen <sowen@cloudera.com>
Closes#10866 from srowen/SPARK-12760.
…local vs cluster
srowen thanks for the PR at https://github.com/apache/spark/pull/10866! sorry it took me a while.
This is related to https://github.com/apache/spark/pull/10866, basically the assignment in the lambda expression in the python example is actually invalid
```
In [1]: data = [1, 2, 3, 4, 5]
In [2]: counter = 0
In [3]: rdd = sc.parallelize(data)
In [4]: rdd.foreach(lambda x: counter += x)
File "<ipython-input-4-fcb86c182bad>", line 1
rdd.foreach(lambda x: counter += x)
^
SyntaxError: invalid syntax
```
Author: Mortada Mehyar <mortada.mehyar@gmail.com>
Closes#10867 from mortada/doc_python_fix.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
Several Spark properties equivalent to Spark submit command line options are missing.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10491 from felixcheung/sparksubmitdoc.
Include the following changes:
1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
3. Update the ActorWordCount example and add the JavaActorWordCount example
4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10744 from zsxwing/streaming-akka-2.
shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table`
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10406 from felixcheung/readtable.
This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10822 from zsxwing/kinesis-doc.
This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10746 from zsxwing/flume-doc.
http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
```
val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
```
should be
```
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
```
cc: jkbradley
Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu>
Closes#10769 from Agent007/SPARK-12722.
This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.
/cc rxin srowen
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10775 from JoshRosen/add-hadoop-2.7-profile.
Fixed WSSSE computeCost in Python mllib KMeans user guide example by using new computeCost method API in Python.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10707 from jkbradley/kmeans-doc-fix.
The default run has changed, but the documentation didn't fully reflect the change.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#10740 from skyluc/issue/mesos-modes-doc.
Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE.
Our training folks hit this exact same issue when concocting an example and had the same solution.
Author: Sean Owen <sowen@cloudera.com>
Closes#10675 from srowen/SPARK-5273.
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)
See also https://github.com/apache/spark/pull/10512
Author: Sean Owen <sowen@cloudera.com>
Closes#10513 from srowen/SPARK-4819.
spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml
Author: Jeff Zhang <zjffdu@apache.org>
Closes#10657 from zjffdu/doc-fix.
modify 'spark.memory.offHeap.enabled' default value to false
Author: zzcclp <xm_zzc@sina.com>
Closes#10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.
Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.
For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10534 from JoshRosen/remove-ttl-based-cleaning.
For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.
This PR aims to fix both issues.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10328 from BenFradet/SPARK-12368.
Update user guide doc for ```DecisionTreeRegressor``` providing variance of prediction.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10594 from yanboliang/spark-12570.
Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.
In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.
This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).
If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).
This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10519 from JoshRosen/jdbc-driver-precedence.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.
Author: Reynold Xin <rxin@databricks.com>
Closes#10531 from rxin/SPARK-12588.
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10385 from zsxwing/accumulator-broadcast-example.
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#10342 from davies/lz4.
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10092 from gatorsmile/persistStorageLevel.
- Provide example on `message handler`
- Provide bit on KPL record de-aggregation
- Fix typos
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#9970 from brkyvz/kinesis-docs.
No known breaking changes, but some deprecations and changes of behavior.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10235 from jkbradley/mllib-guide-update-1.6.
This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#9952 from yu-iskw/SPARK-6518.
Adding more documentation about submitting jobs with mesos cluster mode.
Author: Timothy Chen <tnachen@gmail.com>
Closes#10086 from tnachen/mesos_supervise_docs.
Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation.
I wonder if I should also add a snippet to the code example, input welcome.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10257 from BenFradet/SPARK-12217.
Adding in Pipeline Import and Export Documentation.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes#10179 from anabranch/master.
With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise.
zsxwing tdas , please review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#10246 from jerryshao/direct-kafka-doc-update.
This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.
- Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
- Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
- Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
- Document these configurations on the configuration page.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10237 from JoshRosen/SPARK-12251.
This avoids bringing up yet another HTTP server on the driver, and
instead reuses the file server already managed by the driver's
RpcEnv. As a bonus, the repl now inherits the security features of
the network library.
There's also a small change to create the directory for storing classes
under the root temp dir for the application (instead of directly
under java.io.tmpdir).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9923 from vanzin/SPARK-11563.
Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
It also removes some files that I forgot to delete with #10207
Author: Timothy Hunter <timhunter@databricks.com>
Closes#10234 from thunterdb/12212.
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10166 from BenFradet/SPARK-12159.
This reverts PR #10002, commit 78209b0cca.
The original PR wasn't tested on Jenkins before being merged.
Author: Cheng Lian <lian@databricks.com>
Closes#10200 from liancheng/revert-pr-10002.
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10006 from yanboliang/spark-11958.
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes#10002 from somideshmukh/SomilBranch1.33.
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.
**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
Author: Andrew Or <andrew@databricks.com>
Closes#10081 from andrewor14/unified-memory-small-heaps.