Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`.
Author: Keiji Yoshida <yoshida.keiji.84@gmail.com>
Closes#8372 from yosssi/patch-1.
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#8267 from yinxusen/SPARK-9893.
Added user guide for multilayer perceptron classifier:
- Simplified description of the multilayer perceptron classifier
- Example code for Scala and Java
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#8262 from avulanov/SPARK-9846-mlpc-docs.
This allows skipping the code that tries to talk to Hive and HBase to
fetch delegation tokens, in case that somehow conflicts with the application
being run.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8134 from vanzin/SPARK-9833.
1, Add Python example for mllib FP-growth user guide.
2, Correct mistakes of Scala and Java examples.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8279 from yanboliang/spark-10084.
New user guide section ml-decision-tree.md, including code examples.
I have run all examples, including the Java ones.
CC: manishamde yanboliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8244 from jkbradley/ml-dt-docs.
By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label.
I think it is better to make it explicit on documentation.
Author: lewuathe <lewuathe@me.com>
Closes#8205 from Lewuathe/SPARK-9977.
`Lists.newArrayList` -> `Arrays.asList`
CC jkbradley feynmanliang
Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond.
Author: Sean Owen <sowen@cloudera.com>
Closes#8272 from srowen/SPARK-10070.
SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#7831 from avulanov/SPARK-9508-pregel-doc2.
Add a new test case in yarn/ClientSuite which checks how the various SparkConf
and ClientArguments propagate into the ApplicationSubmissionContext.
Author: Dennis Huo <dhuo@google.com>
Closes#8072 from dennishuo/dhuo-yarn-application-tags.
Adds user guide for `PrefixSpan`, including Scala and Java example code.
mengxr zhangjiajin
Author: Feynman Liang <fliang@databricks.com>
Closes#8253 from feynmanliang/SPARK-9898.
Add Python API, user guide and example for ml.feature.ElementwiseProduct.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8061 from yanboliang/SPARK-9768.
Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0.
Author: Reynold Xin <rxin@databricks.com>
Closes#8162 from rxin/SPARK-9934.
… allocation are set. Now, dynamic allocation is set to false when num-executors is explicitly specified as an argument. Consequently, executorAllocationManager in not initialized in the SparkContext.
Author: Niranjan Padmanabhan <niranjan.padmanabhan@cloudera.com>
Closes#7657 from neurons/SPARK-9092.
Some users like to download additional files in their sandbox that they can refer to from their spark program, or even later mount these files to another directory.
Author: Timothy Chen <tnachen@gmail.com>
Closes#7195 from tnachen/mesos_files.
This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#8085 from ericl/docs.
This PR is based on #4229, thanks prabeesh.
Closes#4229
Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>
Closes#7833 from zsxwing/pr4229 and squashes the following commits:
9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
Author: Mahmoud Lababidi <lababidi@gmail.com>
Closes#8076 from lababidi/master and squashes the following commits:
af4553b [Mahmoud Lababidi] Fixed AtmoicReference<> Example
Straightforward fix on doc typo
Author: Jeff Zhang <zjffdu@apache.org>
Closes#8019 from zjffdu/master and squashes the following commits:
aed6e64 [Jeff Zhang] Fix doc typo
spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.
cc marmbrus rxin
Author: Davies Liu <davies@databricks.com>
Closes#7998 from davies/tungsten and squashes the following commits:
c1c16da [Davies Liu] update doc
1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
(cherry picked from commit 4e70e8256c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.
cc marmbrus rxin
Author: Davies Liu <davies@databricks.com>
Closes#7998 from davies/tungsten and squashes the following commits:
c1c16da [Davies Liu] update doc
1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
Document spark.shuffle.service.{enabled,port}
CC sryza tgravescs
This is pretty minimal; is there more to say here about the service?
Author: Sean Owen <sowen@cloudera.com>
Closes#7991 from srowen/SPARK-9641 and squashes the following commits:
3bb946e [Sean Owen] Add link to docs for setup and config of external shuffle service
2302e01 [Sean Owen] Document spark.shuffle.service.{enabled,port}
mengxr This adds the `BlockMatrix` to PySpark. I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits:
27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
Author: Namit Katariya <katariya.namit@gmail.com>
Closes#7935 from namitk/SPARK-9601 and squashes the following commits:
03b5784 [Namit Katariya] [SPARK-9601] Fix signature of JavaPairDStream for stream-stream and windowed join in streaming guide doc
This pull request groups all the prereq requirements into a single section.
cc srowen shivaram
Author: Reynold Xin <rxin@databricks.com>
Closes#7951 from rxin/readme-docs and squashes the following commits:
ab7ded0 [Reynold Xin] Updated docs/README.md to put all prereqs together.
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
4ad6819 [Mike Dusenberry] Documenting the and parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
Author: Sean Owen <sowen@cloudera.com>
Closes#7905 from srowen/SPARK-9521.2 and squashes the following commits:
73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
This puts all the install commands that need to be run in one section instead of being spread over many paragraphs
cc rxin
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#7912 from shivaram/docs-setup-readme and squashes the following commits:
cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs
Add ml.PCA user guide document and code examples for Scala/Java/Python.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7522 from yanboliang/ml-pca-md and squashes the following commits:
60dec05 [Yanbo Liang] address comments
f992abe [Yanbo Liang] Add ml.PCA doc and examples
Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits:
a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults
#7142 made codegen enabled by default so let's modify the corresponding documents.
Closes#7142
Author: KaiXinXiaoLei <huleilei1@huawei.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#7863 from sarutak/SPARK-9535 and squashes the following commits:
0884424 [Kousuke Saruta] Removed a line which mentioned about the effect of codegen enabled
3c11af0 [Kousuke Saruta] Merge branch 'sqlconfig' of https://github.com/KaiXinXiaoLei/spark into SPARK-9535
4ee531d [KaiXinXiaoLei] delete space
4cfd11d [KaiXinXiaoLei] change spark.sql.planner.externalSort
d624cf8 [KaiXinXiaoLei] sql config is wrong
Use print(x) not print x for Python 3 in eval examples
CC sethah mengxr -- just wanted to close this out before 1.5
Author: Sean Owen <sowen@cloudera.com>
Closes#7822 from srowen/SPARK-9490 and squashes the following commits:
01abeba [Sean Owen] Change "print x" to "print(x)" in the rest of the docs too
bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples
This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
Author: zsxwing <zsxwing@gmail.com>
Closes#6955 from zsxwing/kinesis-python and squashes the following commits:
e42e471 [zsxwing] Merge branch 'master' into kinesis-python
455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
5082d28 [zsxwing] Fix the syntax error for Python 2.6
fca416b [zsxwing] Fix wrong comparison
96670ff [zsxwing] Fix the compilation error after merging master
756a128 [zsxwing] Merge branch 'master' into kinesis-python
6c37395 [zsxwing] Print stack trace for debug
7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
cc9d071 [zsxwing] Fix the python test errors
466b425 [zsxwing] Add python tests for Kinesis
e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
3da2601 [zsxwing] Fix the kinesis folder
687446b [zsxwing] Fix the error message and the maven output path
add2beb [zsxwing] Merge branch 'master' into kinesis-python
4957c0b [zsxwing] Add the Python API for Kinesis
Author: sethah <seth.hendrickson16@gmail.com>
Closes#7655 from sethah/Working_on_6129 and squashes the following commits:
253db2d [sethah] removed number formatting from example code
b769cab [sethah] rewording threshold section
d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
c9dd058 [sethah] Cleaning up and formatting metrics user guide section
6f31c21 [sethah] All example code for metrics section done
98813fe [sethah] Most java and python example code added. Further latex formatting
53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7651 from vanzin/SPARK-9327 and squashes the following commits:
2923e23 [Marcelo Vanzin] [SPARK-9327] [docs] Fix documentation about classpath config options.
Pregel example to express single source shortest path from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api does not work due to incorrect type. The reason is that `GraphGenerators.logNormalGraph` returns the graph with `Long` vertices. Fixing `val graph: Graph[Int, Double]` to `val graph: Graph[Long, Double]`.
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#7695 from avulanov/SPARK-9380-pregel-doc and squashes the following commits:
c269429 [Alexander Ulanov] Pregel example type fix
Some users may not be aware that the logs are available on Web UI even if Yarn log aggregation is enabled. Update the doc to make this clear and what need to be configured.
Author: Carson Wang <carson.wang@intel.com>
Closes#7463 from carsonwang/YarnLogDoc and squashes the following commits:
274c054 [Carson Wang] Minor text fix
74df3a1 [Carson Wang] address comments
5a95046 [Carson Wang] Update the text in the doc
e5775c1 [Carson Wang] Update doc about how to view the logs on Web UI when yarn log aggregation is enabled
PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now.
Author: Cheng Lian <lian@databricks.com>
Closes#7612 from liancheng/spark-9207 and squashes the following commits:
77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
Spark has an option called spark.localExecution.enabled; according to the docs:
> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
This pull request simply brings #7484 up to date.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#7585 from rxin/remove-local-exec and squashes the following commits:
84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.
- spark.akka.frameSize: This defaults at 10 but is often hit for map
output statuses in large shuffles. This memory is not fully allocated
up-front, so we can just make this larger and still not affect jobs
that never sent a status that large. We increase it to 128.
- spark.executor.memory: Defaults at 512m, which is really small. We
increase it to 1g.
Author: Matei Zaharia <matei@databricks.com>
Closes#7586 from mateiz/configs and squashes the following commits:
ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
Add support for saving and loading LDA both the local and distributed versions.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6948 from MechCoder/lda_save_load and squashes the following commits:
49bcdce [MechCoder] minor style fixes
cc14054 [MechCoder] minor
4587d1d [MechCoder] Minor changes
c753122 [MechCoder] Load and save the model in private methods
2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
These commits address a few minor issues in the Scala cross-version support in the build:
1. Correct two missing `${scala.binary.version}` pom file substitutions.
2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.
This is my original work and I license this work to the Spark project under the Apache License.
Author: Michael Allman <michael@videoamp.com>
Closes#6832 from mallman/scala-versions and squashes the following commits:
cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}
Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.
Author: Timothy Chen <tnachen@gmail.com>
Closes#4960 from tnachen/mesos_fw_auth and squashes the following commits:
0f9f03e [Timothy Chen] Fix review comments.
8f9488a [Timothy Chen] Fix rebase
f7fc2a9 [Timothy Chen] Add mesos role, auth and secret.
jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`.
dbtsai I left the code tab for you to add example code. Do you think it is the right place?
Author: Shuo Xiang <shuoxiangpub@gmail.com>
Closes#6504 from coderxiang/elasticnet and squashes the following commits:
f6061ee [Shuo Xiang] typo
90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods
8747190 [Shuo Xiang] merge master
706d3f7 [Shuo Xiang] add python code
9bc2b4c [Shuo Xiang] typo
db32a60 [Shuo Xiang] java code sample
aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
a0dae07 [Shuo Xiang] simplify code
d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge
df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md
78d9366 [Shuo Xiang] address comments
8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet
8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
9262a72 [Shuo Xiang] update
7e07d12 [Shuo Xiang] update
b32f21a [Shuo Xiang] add doc for elastic net in sparkml
937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test
This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run.
Author: FlytxtRnD <meethu.mathew@flytxt.com>
Closes#6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits:
94b56df [FlytxtRnD] style correction
ef95ee2 [FlytxtRnD] style correction
c446c58 [FlytxtRnD] documentation and numRuns warning change
06d13ef [FlytxtRnD] numRuns corrected
d12336e [FlytxtRnD] numRuns variable modifications
07f8554 [FlytxtRnD] remove setRuns from setIntialModel
e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
242ead1 [FlytxtRnD] corrected == to === in assert
714acb5 [FlytxtRnD] added numRuns
60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed
582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
3f5fc8e [FlytxtRnD] test case modified and one runs condition added
cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018
e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria
6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.".
The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.
Author: zhaishidan <zhaishidan@haizhi.com>
Closes#7393 from stanzhai/master and squashes the following commits:
69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb
This contribution is my original work and I license it to the project under it's open source license.
Author: jose.cambronero <jose.cambronero@cloudera.com>
Closes#6994 from josepablocam/master and squashes the following commits:
bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests
2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
e760ebd [jose.cambronero] line length changes to fit style check
3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
b9cff3a [jose.cambronero] made small changes to pass style check
ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
4da189b [jose.cambronero] added user facing ks test functions
c659ea1 [jose.cambronero] created KS test class
13dfe4d [jose.cambronero] created test result class for ks test
pwendell and I discussed this a little more offline and concluded that it would be good to keep it more conservative. Losing cached blocks may be very expensive and we should only allow it if the user knows what he/she is doing.
FYI harishreedharan sryza.
Author: Andrew Or <andrew@databricks.com>
Closes#7329 from andrewor14/da-cached-timeout and squashes the following commits:
cef0b4e [Andrew Or] Change timeout to infinity
Runs for *all* existing keys and returning "None" will remove the key-value pair.
Author: Michael Vogiatzis <michaelvogiatzis@gmail.com>
Closes#7229 from mvogiatzis/patch-1 and squashes the following commits:
e7a2946 [Michael Vogiatzis] Updated updateStateByKey text
00283ed [Michael Vogiatzis] Removed space
c2656f9 [Michael Vogiatzis] Moved description farther up
0a42551 [Michael Vogiatzis] Added important updateStateByKey details
A couple descriptions were not inside `<td></td>` and were being displayed immediately under the section title instead of in their row.
Author: Jonathan Alter <jonalter@users.noreply.github.com>
Closes#7292 from jonalter/docs-config and squashes the following commits:
5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
…ng-guide#Manually Specifying Options to be in sync with java,python, R version
Author: Alok Singh <“singhal@us.ibm.com”>
Closes#7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits:
d3c20ba [Alok Singh] fix the file to .parquet from .json
d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
Add documentation for NGram feature transformer.
Author: Feynman Liang <fliang@databricks.com>
Closes#7244 from feynmanliang/SPARK-8457 and squashes the following commits:
5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab
60d5ac0 [Feynman Liang] Inline API doc and fix indentation
736ccbc [Feynman Liang] NGram feature transformer documentation
cc pwendell
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#7293 from shivaram/sparkr-packages-doc and squashes the following commits:
c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
Author: Sun Rui <rui.sun@intel.com>
Closes#7287 from sun-rui/SPARK-8894 and squashes the following commits:
da63898 [Sun Rui] [SPARK-8894][SPARKR][DOC] Example code errors in SparkR documentation.
Fixed comment given by rxin
Author: Tijo Thomas <tijoparacka@gmail.com>
Closes#7281 from tijoparacka/modification_for_python_style and squashes the following commits:
6334e21 [Tijo Thomas] removed space
3de4cd8 [Tijo Thomas] python Style update
Updated MLlib Data Types Local Matrix section to include information on sparse matrices, added sparse matrix examples to the Scala and Java examples, and added Python examples for both dense and sparse matrices.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#6958 from dusenberrymw/Improve_MLlib_Local_Matrix_Documentation and squashes the following commits:
ceae407 [Mike Dusenberry] Updated MLlib Data Types Local Matrix section to include information on sparse matrices, added sparse matrix examples to the Scala and Java examples, and added Python examples for both dense and sparse matrices.
See the jira https://issues.apache.org/jira/browse/SPARK-5562
Author: Alok Singh <singhal@Aloks-MacBook-Pro.local>
Author: Alok Singh <singhal@aloks-mbp.usca.ibm.com>
Author: Alok Singh <“singhal@us.ibm.com”>
Closes#7064 from aloknsingh/aloknsingh_SPARK-5562 and squashes the following commits:
259a0a7 [Alok Singh] change as per the comments by @jkbradley
be48491 [Alok Singh] [SPARK-5562][MLlib] re-order import in alphabhetical order
c01311b [Alok Singh] [SPARK-5562][MLlib] fix the newline typo
b271c8a [Alok Singh] [SPARK-5562][Mllib] As per github discussion with jkbradley. We would like to simply things.
7c06251 [Alok Singh] [SPARK-5562][MLlib] modified the JavaLDASuite for test passing
c710cb6 [Alok Singh] fix the scala code style to have space after :
2572a08 [Alok Singh] [SPARK-5562][MLlib] change the import xyz._ to the import xyz.{c1, c2} ..
ab55fbf [Alok Singh] [SPARK-5562][MLlib] Change as per Sean Owen's comments https://github.com/apache/spark/pull/7064/files#diff-9236d23975e6f5a5608ffc81dfd79146
9f4f9ea [Alok Singh] [SPARK-5562][MLlib] LDA should handle empty document.
Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources when trying to determine the usablility of a resource offer from a mesos slave node. It may be preferable for the user to be able to ensure that the spark jobs are only started on a certain set of nodes (based on attributes).
For example, If the user sets a property, let's say `spark.mesos.constraints` is set to `tachyon=true;us-east-1=false`, then the resource offers will be checked to see if they meet both these constraints and only then will be accepted to start new executors.
Author: Ankur Chauhan <achauhan@brightcove.com>
Closes#5563 from ankurcha/mesos_attribs and squashes the following commits:
902535b [Ankur Chauhan] Fix line length
d83801c [Ankur Chauhan] Update code as per code review comments
8b73f2d [Ankur Chauhan] Fix imports
c3523e7 [Ankur Chauhan] Added docs
1a24d0b [Ankur Chauhan] Expand scope of attributes matching to include all data types
482fd71 [Ankur Chauhan] Update access modifier to private[this] for offer constraints
5ccc32d [Ankur Chauhan] Fix nit pick whitespace
1bce782 [Ankur Chauhan] Fix nit pick whitespace
c0cbc75 [Ankur Chauhan] Use offer id value for debug message
7fee0ea [Ankur Chauhan] Add debug statements
fc7eb5b [Ankur Chauhan] Fix import codestyle
00be252 [Ankur Chauhan] Style changes as per code review comments
662535f [Ankur Chauhan] Incorporate code review comments + use SparkFunSuite
fdc0937 [Ankur Chauhan] Decline offers that did not meet criteria
67b58a0 [Ankur Chauhan] Add documentation for spark.mesos.constraints
63f53f4 [Ankur Chauhan] Update codestyle - uniform style for config values
02031e4 [Ankur Chauhan] Fix scalastyle warnings in tests
c09ed84 [Ankur Chauhan] Fixed the access modifier on offerConstraints val to private[mesos]
0c64df6 [Ankur Chauhan] Rename overhead fractions to memory_*, fix spacing
8cc1e8f [Ankur Chauhan] Make exception message more explicit about the source of the error
addedba [Ankur Chauhan] Added test case for malformed constraint string
ec9d9a6 [Ankur Chauhan] Add tests for parse constraint string
72fe88a [Ankur Chauhan] Fix up tests + remove redundant method override, combine utility class into new mesos scheduler util trait
92b47fd [Ankur Chauhan] Add attributes based constraints support to MesosScheduler
Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags
Author: Deron Eriksson <deron@us.ibm.com>
Closes#7169 from deroneriksson/SPARK-1564_JavaDocs_badges and squashes the following commits:
a8353db [Deron Eriksson] added license headers to api-docs.css and api-javadocs.css
07feb07 [Deron Eriksson] added linebreaks to make jquery more readable when adding html badge tags
65b4930 [Deron Eriksson] Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags
Add Python user guide for PowerIterationClustering
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7155 from yanboliang/spark-8758 and squashes the following commits:
18d803b [Yanbo Liang] address comments
dd29577 [Yanbo Liang] Add Python user guide for PowerIterationClustering
I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place.
Please let me know if I've missed anything.
Will the spark-shell use the value within the command line builder during instantiation?
Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Closes#7132 from ilganeli/SPARK-3071 and squashes the following commits:
4074164 [Ilya Ganelin] String fix
271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071
273b6e9 [Ilya Ganelin] Test fix
fd67721 [Ilya Ganelin] Update JavaUtils.java
26cc177 [Ilya Ganelin] test fix
e5db35d [Ilya Ganelin] Fixed test failure
39732a1 [Ilya Ganelin] merge fix
a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each
09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala
19b6f25 [Ilya Ganelin] Missed one doc update
2698a3d [Ilya Ganelin] Updated default value for driver memory
Author: zsxwing <zsxwing@gmail.com>
Closes#6830 from zsxwing/flume-python and squashes the following commits:
78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
jira: https://issues.apache.org/jira/browse/SPARK-8308
1. add some missing save/load in python examples. , LogisticRegression, LinearRegression and NaiveBayes
2. tune down iterations for MatrixFactorization, since current number will trigger StackOverflow for default java configuration (>1M)
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6760 from hhbyyh/docUpdate and squashes the following commits:
9bd3383 [Yuhao Yang] update scala example
8a44692 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
077cbb8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
3e948dc [Yuhao Yang] add missing save load for python example
Author: sethah <seth.hendrickson16@gmail.com>
Closes#7029 from sethah/working_on_SPARK-7739 and squashes the following commits:
ef96916 [sethah] Fixing some style issues
efea1f8 [sethah] adding clarification to ChiSqSelector example
Modified the deprecated jdbc api in the documentation.
Author: Tijo Thomas <tijoparacka@gmail.com>
Closes#7039 from tijoparacka/JIRA_8615 and squashes the following commits:
6e73b8a [Tijo Thomas] Reverted new lines
4042fcf [Tijo Thomas] updated to sql documentation
a27949c [Tijo Thomas] Fixed Sample deprecated code
Python bindings for StreamingLinearRegressionWithSGD
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6744 from MechCoder/spark-4127 and squashes the following commits:
d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
1b4ddd6 [MechCoder] minor
4de6c68 [MechCoder] Minor refactor
5e85a3b [MechCoder] Add tests for simultaneous training and prediction
fb27889 [MechCoder] Add example and docs
505380b [MechCoder] Add tests
d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
As per the description in the JIRA, I moved the contents of the page and added a few additional content.
Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
Closes#6924 from nssalian/SPARK-3629 and squashes the following commits:
944b7a0 [Neelesh Srinivas Salian] Changed the lines about deploy-mode and added backticks to all parameters
40dbc0b [Neelesh Srinivas Salian] Changed dfs to HDFS, deploy-mode in backticks and updated the master yarn line
9cbc072 [Neelesh Srinivas Salian] Updated a few lines in the Launching Spark on YARN Section
8e8db7f [Neelesh Srinivas Salian] Removed the changes in this commit to help clearly distinguish movement from update
151c298 [Neelesh Srinivas Salian] SPARK-3629: Improvement of the Spark on YARN document
Ticket: [SPARK-8639](https://issues.apache.org/jira/browse/SPARK-8639)
fixed minor typos in docs/README.md and docs/api.md
Author: Rosstin <asterazul@gmail.com>
Closes#7046 from Rosstin/SPARK-8639 and squashes the following commits:
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.
This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.
Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6752 from vanzin/SPARK-8302 and squashes the following commits:
4bff8d4 [Marcelo Vanzin] Add docs, rename configs.
0aa2a02 [Marcelo Vanzin] Only do replacement for paths that need it.
2e9cc9d [Marcelo Vanzin] Style.
a5e1f68 [Marcelo Vanzin] [SPARK-8302] Support heterogeneous cluster install paths on YARN.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits:
b60dd63 [Holden Karau] Add an example with the spark-csv package
fa8bc92 [Holden Karau] typo: sparm -> spark
865a90c [Holden Karau] strip spaces for comparision
c7a4471 [Holden Karau] Add some documentation
c1a9233 [Holden Karau] refactor for testing
c818556 [Holden Karau] Add pakages to R
This PR only applies to master branch (1.5.0-SNAPSHOT) since it references `org.apache.parquet` classes which only appear in Parquet 1.7.0.
Author: Cheng Lian <lian@databricks.com>
Closes#6683 from liancheng/output-committer-docs and squashes the following commits:
b4648b8 [Cheng Lian] Removes spark.sql.sources.outputCommitterClass as it's not a public option
ee63923 [Cheng Lian] Updates docs and comments of data sources and Parquet output committer options
Reorganized docs a bit. Added migration guides.
**Q**: Do we want to say more for the 1.3 -> 1.4 migration guide for ```spark.ml```? It would be a lot.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6897 from jkbradley/ml-guide-1.4 and squashes the following commits:
4bf26d6 [Joseph K. Bradley] tiny fix
8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from previous commit in this PR
6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4
Author: cody koeninger <cody@koeninger.org>
Closes#6863 from koeninger/SPARK-8390 and squashes the following commits:
26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390
3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs
b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing
bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
Python bindings for StreamingKMeans
Will change status to MRG once docs, tests and examples are updated.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6499 from MechCoder/spark-4118 and squashes the following commits:
7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
Clarify what may cause long-running Spark apps to preserve shuffle files
Author: Sean Owen <sowen@cloudera.com>
Closes#6901 from srowen/SPARK-5836 and squashes the following commits:
a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
This fixes various minor documentation issues on the Spark SQL page
Author: Lars Francke <lars.francke@gmail.com>
Closes#6890 from lfrancke/SPARK-8462 and squashes the following commits:
dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.
Author: zsxwing <zsxwing@gmail.com>
Closes#6829 from zsxwing/flume-sink-dep and squashes the following commits:
f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc
This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:
![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)
This makes it easier for users to link to specific sections of the documentation.
I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6808 from JoshRosen/SPARK-8353 and squashes the following commits:
e59d8a7 [Josh Rosen] Suppress underline on hover
f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code
Added python code to https://spark.apache.org/docs/latest/streaming-programming-guide.html
to the Level of Parallelism in Data Receiving section.
Please review and let me know if there are any additional changes that are needed.
Thank you.
Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
Closes#6862 from nssalian/SPARK-8320 and squashes the following commits:
4bfd126 [Neelesh Srinivas Salian] Changed loop structure to be more in line with Python style
e5345de [Neelesh Srinivas Salian] Changes to kafak append, for loop and show to print()
3fc5c6d [Neelesh Srinivas Salian] SPARK-8320
1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`.
2. Verify the value when setting a configuration if this is in SQLConf.
3. Use `SET -v` to display all public configurations.
Author: zsxwing <zsxwing@gmail.com>
Closes#6747 from zsxwing/sqlconf and squashes the following commits:
7d09bad [zsxwing] Use SQLConfEntry in HiveContext
49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext
e014f53 [zsxwing] Merge branch 'master' into sqlconf
93dad8e [zsxwing] Fix the unit tests
cf950c1 [zsxwing] Fix the code style and tests
3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style
a2f4add [zsxwing] getConf will return the default value if a config is not set
037b1db [zsxwing] Add schema to SetCommand
0520c3c [zsxwing] Merge branch 'master' into sqlconf
7afb0ec [zsxwing] Fix the configurations about HiveThriftServer
7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString'
5e95b10 [zsxwing] Add enumConf
c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString
4abd807 [zsxwing] Fix the test for 'set -v'
6e47e56 [zsxwing] Fix the compilation error
8973ced [zsxwing] Remove floatConf
1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead
99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string
88a03cc [zsxwing] Add new lines between confs and return types
ce7c6c8 [zsxwing] Remove seqConf
f3c1b33 [zsxwing] Refactor SQLConf to display better error message
Python API for org.apache.spark.mllib.feature.ElementwiseProduct
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6346 from MechCoder/spark-7605 and squashes the following commits:
79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
start-slave.sh no longer takes a worker # param in 1.4+
Author: Sean Owen <sowen@cloudera.com>
Closes#6855 from srowen/SPARK-8395 and squashes the following commits:
300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+