## What changes were proposed in this pull request?
Put Kafka 0.8 support behind a kafka-0-8 profile.
## How was this patch tested?
Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.
Author: Sean Owen <sowen@cloudera.com>
Closes#19134 from srowen/SPARK-21893.
## What changes were proposed in this pull request?
This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors.
This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display.
## How was this patch tested?
Add Unit test to verify it, also manually verified in real cluster.
Author: jerryshao <sshao@hortonworks.com>
Closes#18935 from jerryshao/SPARK-9104.
## What changes were proposed in this pull request?
There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:
```scala
val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
df.show()
```
**Before**
```
java.lang.NullPointerException
at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
...
```
**After**
```
+---+----+--------+
| a| b|unparsed|
+---+----+--------+
| a|null| a|
+---+----+--------+
```
It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.
## How was this patch tested?
Unit test added in `CSVSuite.scala`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19113 from HyukjinKwon/bump-up-univocity.
…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
## What changes were proposed in this pull request?
This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)
This change does _not_ fully enable a Scala 2.12 build:
- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
## How was this patch tested?
Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
Author: Sean Owen <sowen@cloudera.com>
Closes#18645 from srowen/SPARK-14280.
Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.
Author: ArtRand <arand@soe.ucsc.edu>
Closes#18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
## What changes were proposed in this pull request?
This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
```sql
SELECT *
FROM RANGE(1000)
WHERE
TRUE
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
```
This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#19042 from hvanhovell/SPARK-21830.
## What changes were proposed in this pull request?
Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
- Maintainability: Reduce the Hive dependency and can remove old legacy code later.
Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
- Usability: User can use ORC data sources without hive module, i.e, -Phive.
- Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.
## How was this patch tested?
Pass the jenkins.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18640 from dongjoon-hyun/SPARK-21422.
## What changes were proposed in this pull request?
Update sbt version to 0.13.16. I think this is a useful stepping stone to getting to sbt 1.0.0.
## How was this patch tested?
Existing Build.
Author: pj.fanning <pj.fanning@workday.com>
Closes#18921 from pjfanning/SPARK-21709.
## What changes were proposed in this pull request?
This is trivial, but bugged me. We should download software over HTTPS.
And we can use RAT 0.12 while at it to pick up bug fixes.
## How was this patch tested?
N/A
Author: Sean Owen <sowen@cloudera.com>
Closes#18927 from srowen/Rat012.
## What changes were proposed in this pull request?
This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.
Major diffs between the latest release and v1.3.0 in the master are as follows (62f7547abb...6d4693f562);
- fixed NPE in XXHashFactory similarly
- Don't place resources in default package to support shading
- Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
- Try to load lz4-java from java.library.path, then fallback to bundled
- Add ppc64le binary
- Add s390x JNI binding
- Add basic LZ4 Frame v1.5.0 support
- enable aarch64 support for lz4-java
- Allow unsafeInstance() for ppc64le archiecture
- Add unsafeInstance support for AArch64
- Support 64-bit JNI build on Solaris
- Avoid over-allocating a buffer
- Allow EndMark to be incompressible for LZ4FrameInputStream.
- Concat byte stream
## How was this patch tested?
Existing tests.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18883 from maropu/SPARK-21276.
## What changes were proposed in this pull request?
Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#18797 from WeichenXu123/update-breeze.
## What changes were proposed in this pull request?
Taking over https://github.com/apache/spark/pull/18789 ; Closes#18789
Update Jackson to 2.6.7 uniformly, and some components to 2.6.7.1, to get some fixes and prep for Scala 2.12
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#18881 from srowen/SPARK-20433.
## What changes were proposed in this pull request?
This PR proposes to use https://rversions.r-pkg.org/r-release-win instead of https://rversions.r-pkg.org/r-release to check R's version for Windows correctly.
We met a syncing problem with Windows release (see #15709) before. To cut this short, it was ...
- 3.3.2 release was released but not for Windows for few hours.
- `https://rversions.r-pkg.org/r-release` returns the latest as 3.3.2 and the download link for 3.3.1 becomes `windows/base/old` by our script
- 3.3.2 release for WIndows yet
- 3.3.1 is still not in `windows/base/old` but `windows/base` as the latest
- Failed to download with `windows/base/old` link and builds were broken
I believe this problem is not only what we met. Please see 01ce943929 and also this `r-release-win` API came out between 3.3.1 and 3.3.2 (assuming to deal with this issue), please see `https://github.com/metacran/rversions.app/issues/2`.
Using this API will prevent the problem although it looks quite rare assuming from the commit logs in https://github.com/metacran/rversions.app/commits/master. After 3.3.2, both `r-release-win` and `r-release` are being updated together.
## How was this patch tested?
AppVeyor tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18859 from HyukjinKwon/use-reliable-link.
## What changes were proposed in this pull request?
R version update
## How was this patch tested?
AppVeyor
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#18856 from felixcheung/rappveyorver.
## What changes were proposed in this pull request?
This PR proposes to fix few rather typos in `merge_spark_pr.py`.
- `# usage: ./apache-pr-merge.py (see config env vars below)`
-> `# usage: ./merge_spark_pr.py (see config env vars below)`
- `... have local a Spark ...` -> `... have a local Spark ...`
- `... to Apache.` -> `... to Apache Spark.`
I skimmed this file and these look all I could find.
## How was this patch tested?
pep8 check (`./dev/lint-python`).
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18776 from HyukjinKwon/minor-merge-script.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
## What changes were proposed in this pull request?
This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.
**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+--------------------+
|(id + 17.1335742042)|
+--------------------+
| 17.1335742042|
+--------------------+
```
**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-------------------------+
|(id + 17.133574204226083)|
+-------------------------+
| 17.133574204226083|
+-------------------------+
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18546 from dongjoon-hyun/SPARK-21278.
## What changes were proposed in this pull request?
Recently, Jenkins tests were unstable due to unknown reasons as below:
```
/home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was terminated by signal 9
test_result_code, test_result_note = run_tests(tests_timeout)
File "./dev/run-tests-jenkins.py", line 140, in run_tests
test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
KeyError: -9
```
```
Traceback (most recent call last):
File "./dev/run-tests-jenkins.py", line 226, in <module>
main()
File "./dev/run-tests-jenkins.py", line 213, in main
test_result_code, test_result_note = run_tests(tests_timeout)
File "./dev/run-tests-jenkins.py", line 140, in run_tests
test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
KeyError: -10
```
This exception looks causing failing to update the comments in the PR. For example:
![2017-06-23 4 19 41](https://user-images.githubusercontent.com/6477701/27470626-d035ecd8-582f-11e7-883e-0ae6941659b7.png)
![2017-06-23 4 19 50](https://user-images.githubusercontent.com/6477701/27470629-d11ba782-582f-11e7-97e0-64d28cbc19aa.png)
these comment just remain.
This always requires, for both reviewers and the author, a overhead to click and check the logs, which I believe are not really useful.
This PR proposes to leave the code in the PR comment messages and let update the comments.
## How was this patch tested?
Jenkins tests below, I manually gave the error code to test this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18399 from HyukjinKwon/jenkins-print-errors.
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
## What changes were proposed in this pull request?
Fix some typo of the document.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Xianyang Liu <xianyang.liu@intel.com>
Closes#18350 from ConeyLiu/fixtypo.
## What changes were proposed in this pull request?
Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it. In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private. In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.
Summary:
- Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`. Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`. Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.
- The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations. Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.
Old Hierarchy:
```
yarn.security.ServiceCredentialProvider (service loaded)
HadoopFSCredentialProvider
HiveCredentialProvider
HBaseCredentialProvider
yarn.security.ConfigurableCredentialManager
```
New Hierarchy:
```
HadoopDelegationTokenManager
HadoopDelegationTokenProvider (not service loaded)
HadoopFSDelegationTokenProvider
HiveDelegationTokenProvider
HBaseDelegationTokenProvider
yarn.security.ServiceCredentialProvider (service loaded)
yarn.security.YARNHadoopDelegationTokenManager
```
## How was this patch tested?
unit tests
Author: Michael Gummelt <mgummelt@mesosphere.io>
Author: Dr. Stefan Schimanski <sttts@mesosphere.io>
Closes#17723 from mgummelt/SPARK-20434-refactor-kerberos.
## What changes were proposed in this pull request?
Update hadoop-2.7 profile's curator version to 2.7.1, more see [SPARK-13933](https://issues.apache.org/jira/browse/SPARK-13933).
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18247 from wangyum/SPARK-13933.
## What changes were proposed in this pull request?
REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18191 from cloud-fan/test.
## What changes were proposed in this pull request?
Currently, if we run `./python/run-tests.py` and they are aborted without cleaning up this directory, it fails pep8 check due to some Python scripts generated. For example, 7387126f83/python/pyspark/tests.py (L1955-L1968)
```
PEP8 checks failed.
./work/app-20170531190857-0000/0/test.py:5:55: W292 no newline at end of file
./work/app-20170531190909-0000/0/test.py:5:55: W292 no newline at end of file
./work/app-20170531190924-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1
./work/app-20170531190924-0000/0/test.py:7:52: W292 no newline at end of file
./work/app-20170531191016-0000/0/test.py:5:55: W292 no newline at end of file
./work/app-20170531191030-0000/0/test.py:5:55: W292 no newline at end of file
./work/app-20170531191045-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1
./work/app-20170531191045-0000/0/test.py:7:52: W292 no newline at end of file
```
For me, it is sometimes a bit annoying. This PR proposes to exclude these (assuming we want to skip per https://github.com/apache/spark/blob/master/.gitignore#L73).
Also, it moves other pep8 configurations in the script into ini configuration file in pep8.
## How was this patch tested?
Manually tested via `./dev/lint-python`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18161 from HyukjinKwon/work-exclude-pep8.
## What changes were proposed in this pull request?
This PR proposes to fix the lint-breaks as below:
```
[ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream.
```
after:
```
dev/lint-java
Checkstyle checks passed.
```
[Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
## How was this patch tested?
Travis CI
Author: Xianyang Liu <xianyang.liu@intel.com>
Closes#17890 from ConeyLiu/codestyle.
## What changes were proposed in this pull request?
Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.
## How was this patch tested?
Ran `make-distribution` locally
Author: Holden Karau <holden@us.ibm.com>
Closes#17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
## What changes were proposed in this pull request?
Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17803 from srowen/SPARK-20523.
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17746 from yanboliang/spark-20449.
## What changes were proposed in this pull request?
This PR proposes two things as below:
- Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build
Due to a different dependency resolution in SBT & Unidoc by an unknown reason, the documentation build fails on a specific machine & environment in Jenkins but it was unable to reproduce.
So, this PR just checks an environment variable `AMPLAB_JENKINS_BUILD_PROFILE` that is set in Hadoop 2.6 SBT build against branches on Jenkins, and then disables Unidoc build. **Note that PR builder will still build it with Hadoop 2.6 & SBT.**
```
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
...
```
I checked the environment variables from the logs (first bit) as below:
- **spark-master-test-sbt-hadoop-2.6** (this one is being failed) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
SPARK_BRANCH=master
AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6 <- I use this variable
AMPLAB_JENKINS="true"
```
- spark-master-test-sbt-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
SPARK_BRANCH=master
AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.7
AMPLAB_JENKINS="true"
```
- spark-master-test-maven-hadoop-2.6 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
HADOOP_PROFILE=hadoop-2.6
HADOOP_VERSION=
SPARK_BRANCH=master
AMPLAB_JENKINS="true"
```
- spark-master-test-maven-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
HADOOP_PROFILE=hadoop-2.7
HADOOP_VERSION=
SPARK_BRANCH=master
AMPLAB_JENKINS="true"
```
- PR builder - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75843/consoleFull
```
JENKINS_MASTER_HOSTNAME=amp-jenkins-master
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
```
Assuming from other logs in branch-2.1
- SBT & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
SPARK_BRANCH=branch-2.1
AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6
AMPLAB_JENKINS="true"
```
- Maven & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-maven-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
HADOOP_PROFILE=hadoop-2.6
HADOOP_VERSION=
SPARK_BRANCH=branch-2.1
AMPLAB_JENKINS="true"
```
We have been using the same convention for those variables. These are actually being used in `run-tests.py` script - here https://github.com/apache/spark/blob/master/dev/run-tests.py#L519-L520
- Revert the previous try
After https://github.com/apache/spark/pull/17651, it seems the build still fails on SBT Hadoop 2.6 master.
I am unable to reproduce this - https://github.com/apache/spark/pull/17477#issuecomment-294094092 and the reviewer was too. So, this got merged as it looks the only way to verify this is to merge it currently (as no one seems able to reproduce this).
## How was this patch tested?
I only checked `is_hadoop_version_2_6 = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6"` is working fine as expected as below:
```python
>>> import collections
>>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.6"})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
False
>>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.7"})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
True
>>> os = collections.namedtuple('os', 'environ')(environ={})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
True
```
I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.
Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17669 from HyukjinKwon/revert-SPARK-20343.
## What changes were proposed in this pull request?
This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
There are several problems with it:
- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
(see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
## How was this patch tested?
Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
This was tested via manually adding `time.time()` as below:
```diff
profiles_and_goals = build_profiles + sbt_goals
print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
" ".join(profiles_and_goals))
+ import time
+ st = time.time()
exec_sbt(profiles_and_goals)
+ print("Elapsed :[%s]" % str(time.time() - st))
```
produces
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17477 from HyukjinKwon/SPARK-18692.
## What changes were proposed in this pull request?
Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3). Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.
## How was this patch tested?
- Doctests for helper function
## Legal
This is my original work and I license the work to the project under the project’s open source license.
Author: David Gingrich <david@textio.com>
Closes#16845 from dgingrich/topic-spark-19505-py3-exceptions.
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
## What changes were proposed in this pull request?
If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
## How was this patch tested?
manual tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#17452 from zuotingbing/spark-bulid.
## What changes were proposed in this pull request?
Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability.
## How was this patch tested?
Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins.
Author: Holden Karau <holden@us.ibm.com>
Closes#17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
## What changes were proposed in this pull request?
A pyspark wrapper for spark.ml.stat.ChiSquareTest
## How was this patch tested?
unit tests
doctests
Author: Bago Amirbekian <bago@databricks.com>
Closes#17421 from MrBago/chiSquareTestWrapper.
## What changes were proposed in this pull request?
The master snapshot publisher builds are currently broken due to two minor build issues:
1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
## How was this patch tested?
The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#17437 from JoshRosen/spark-20102.
## What changes were proposed in this pull request?
- Add `HasSupport` and `HasConfidence` `Params`.
- Add new module `pyspark.ml.fpm`.
- Add `FPGrowth` / `FPGrowthModel` wrappers.
- Provide tests for new features.
## How was this patch tested?
Unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17218 from zero323/SPARK-19281.
## What changes were proposed in this pull request?
Fixed a typo in `dev/make-distribution.sh` script that sets the MAVEN_OPTS variable, introduced [here](https://github.com/apache/spark/commit/0e24054#diff-ba2c046d92a1d2b5b417788bfb5cb5f8R149).
## How was this patch tested?
Run `dev/make-distribution.sh` manually.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes#16984 from lins05/fix-spark-make-distribution-after-removing-java7.
## What changes were proposed in this pull request?
This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets.
## How was this patch tested?
I ran the test suite for spark-sql-kafka-0-10.
Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>
Closes#16857 from vitillo/kafka_source_fix.
## What changes were proposed in this pull request?
Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima script to run MiMa
This follows on https://github.com/apache/spark/pull/16871 -- it's a slightly separate issue, but, is currently causing a build failure.
## How was this patch tested?
Manually tested.
Author: Sean Owen <sowen@cloudera.com>
Closes#16957 from srowen/SPARK-19550.2.
- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings
For the future:
- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16871 from srowen/SPARK-19493.
## What changes were proposed in this pull request?
After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3.
**BEFORE**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7']
Attempting to post to Github...
> Post successful.
```
**AFTER**
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
```
========================================================================
Building Spark
========================================================================
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
```
## How was this patch tested?
Pass the existing test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16858 from dongjoon-hyun/hotfix_run-tests.
## What changes were proposed in this pull request?
- Remove support for Hadoop 2.5 and earlier
- Remove reflection and code constructs only needed to support multiple versions at once
- Update docs to reflect newer versions
- Remove older versions' builds and profiles.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16810 from srowen/SPARK-19464.
## What changes were proposed in this pull request?
According to the discussion on #16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now.
https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E
This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes.
## How was this patch tested?
Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16751 from dongjoon-hyun/SPARK-19409.
## What changes were proposed in this pull request?
Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.
## How was this patch tested?
Updated sanity test script to import mllib and ml sub-components.
Author: Holden Karau <holden@us.ibm.com>
Closes#16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.
Builds on top of work in SPARK-8425 to update Application Level Blacklisting in the scheduler.
## What changes were proposed in this pull request?
Adds a UI to these patches by:
- defining new listener events for blacklisting and unblacklisting, nodes and executors;
- sending said events at the relevant points in BlacklistTracker;
- adding JSON (de)serialization code for these events;
- augmenting the Executors UI page to show which, and how many, executors are blacklisted;
- adding a unit test to make sure events are being fired;
- adding HistoryServerSuite coverage to verify that the SHS reads these events correctly.
- updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in Executors Status
Updates .rat-excludes to pass tests.
username squito
## How was this patch tested?
./dev/run-tests
testOnly org.apache.spark.util.JsonProtocolSuite
testOnly org.apache.spark.scheduler.BlacklistTrackerSuite
testOnly org.apache.spark.deploy.history.HistoryServerSuite
https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh
![blacklist-20161219](https://cloud.githubusercontent.com/assets/1208477/21335321/9eda320a-c623-11e6-8b8c-9c912a73c276.jpg)
Author: José Hiram Soltren <jose@cloudera.com>
Closes#16346 from jsoltren/SPARK-16654-submit.
**What changes were proposed in this pull request?**
Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up
**How was this patch tested?**
Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#16616 from a-roberts/Hadoop265Bumper.
## What changes were proposed in this pull request?
Refactored script to remove duplications and clearer purpose for each script
## How was this patch tested?
manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16249 from felixcheung/rscripts.
## What changes were proposed in this pull request?
Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16568 from zsxwing/SPARK-18971.
## What changes were proposed in this pull request?
It seems Hadoop libraries need winutils binaries for native libraries in the path.
It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below:
```
- SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds)
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
...
```
This PR proposes to add it to the `Path` for AppVeyor tests.
## How was this patch tested?
Manually via AppVeyor.
**Before**
https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m
**After**
https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16584 from HyukjinKwon/set-path-appveyor.
## What changes were proposed in this pull request?
Updates to libthrift 0.9.3 to address a CVE.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#16530 from srowen/SPARK-18997.
## What changes were proposed in this pull request?
This PR updates dev/create-release/known_translations to add more contributor name mapping. It also fixes a small issue in translate-contributors.py
## How was this patch tested?
manually tested
Author: Yin Huai <yhuai@databricks.com>
Closes#16423 from yhuai/contributors.
## What changes were proposed in this pull request?
make-distribution.sh should find JAVA_HOME for Ubuntu, Mac and other non-RHEL systems
## How was this patch tested?
Manually
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16363 from felixcheung/buildjava.
## What changes were proposed in this pull request?
When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.
This PR also assigns a new group id to the new created consumer for a possible race condition: the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)
## How was this patch tested?
In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console , it ran this flaky test 120 times and all passed.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16282 from zsxwing/kafka-fix.
## What changes were proposed in this pull request?
I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6.
Author: Yin Huai <yhuai@databricks.com>
Closes#16359 from yhuai/SPARK-18951.
Follow up to ae853e8f3b as `mv` throws an error on the Jenkins machines if source and destinations are the same.
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#16302 from shivaram/sparkr-no-mv-fix.
## What changes were proposed in this pull request?
For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g., 2.1.0). Thus `cp` throws an error which causes the build to fail.
## How was this patch tested?
Manually by executing the following script
```
set -o pipefail
set -e
set -x
touch a
R_PACKAGE_VERSION=2.1.0
VERSION=2.1.0
if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then
cp a a
fi
```
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#16299 from shivaram/sparkr-cp-fix.
## What changes were proposed in this pull request?
Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while.
This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure.
Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page.
## How was this patch tested?
N/A.
Author: Cheng Lian <lian@databricks.com>
Closes#16163 from liancheng/jenkins-test-report.
Fix SparkR package copy regex. The existing code leads to
```
Copying release tarballs to /home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f-bin
mput: SparkR-*: no files found
```
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#16231 from shivaram/typo-sparkr-build.
## What changes were proposed in this pull request?
Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822)
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16227 from felixcheung/pyrftp.
This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#16226 from shivaram/fix-sparkr-copy-build.
## What changes were proposed in this pull request?
Fixes name of R source package so that the `cp` in release-build.sh works correctly.
Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#16221 from shivaram/fix-sparkr-release-build-name.
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in https://github.com/apache/spark/pull/16014#issuecomment-265843991
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#16218 from shivaram/fix-sparkr-release-build.
## What changes were proposed in this pull request?
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
This PR also includes a few minor fixes.
### more details
These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
(will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)
Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.
## How was this patch tested?
Manually, CI.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#16014 from felixcheung/rdist.
## What changes were proposed in this pull request?
To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.
## How was this patch tested?
new unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16128 from tdas/SPARK-18671.
## What changes were proposed in this pull request?
Force update to latest Netty 3.9.x, for dependencies like Flume, to resolve two CVEs. 3.9.2 is the first version that resolves both, and, this is the latest in the 3.9.x line.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16102 from srowen/SPARK-18586.
## What changes were proposed in this pull request?
We current build 5 separate pip binary tar balls, doubling the release script runtime. It'd be better to build one, especially for use cases that are just using Spark locally. In the long run, it would make more sense to have Hadoop support be pluggable.
## How was this patch tested?
N/A - this is a release build script that doesn't have any automated test coverage. We will know if it goes wrong when we prepare releases.
Author: Reynold Xin <rxin@databricks.com>
Closes#16072 from rxin/SPARK-18639.
## What changes were proposed in this pull request?
org.codehaus.janino:janino depends on org.codehaus.janino:commons-compiler and we have been upgraded to org.codehaus.janino:janino 3.0.0.
However, seems we are still pulling in org.codehaus.janino:commons-compiler 2.7.6 because of calcite. It looks like an accident because we exclude janino from calcite (see here https://github.com/apache/spark/blob/branch-2.1/pom.xml#L1759). So, this PR upgrades org.codehaus.janino:commons-compiler to 3.0.0.
## How was this patch tested?
jenkins
Author: Yin Huai <yhuai@databricks.com>
Closes#16025 from yhuai/janino-commons-compile.
## What changes were proposed in this pull request?
Updates links to the wiki to links to the new location of content on spark.apache.org.
## How was this patch tested?
Doc builds
Author: Sean Owen <sowen@cloudera.com>
Closes#15967 from srowen/SPARK-18073.1.
## What changes were proposed in this pull request?
This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).
Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*
Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
- consider how we want to number our dev/snapshot versions
Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs
*I've done some work to test release-build locally but as a non-committer I've just done local testing.
## How was this patch tested?
Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.
release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)
Author: Holden Karau <holden@us.ibm.com>
Author: Juliet Hougland <juliet@cloudera.com>
Author: Juliet Hougland <not@myemail.com>
Closes#15659 from holdenk/SPARK-1267-pip-install-pyspark.
## What changes were proposed in this pull request?
Small fix, fix the errors caused by lint check in Java
- Clear unused objects and `UnusedImports`.
- Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle.
- Cut the line which is longer than 100 characters into two lines.
## How was this patch tested?
Travis CI.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
```
Before:
```
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory.
[ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier.
[ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method.
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
[ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
```
After:
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
Checkstyle checks passed.
```
Author: Xianyang Liu <xyliu0530@icloud.com>
Closes#15865 from ConeyLiu/master.
## What changes were proposed in this pull request?
Fix the flags used to specify the hadoop version
## How was this patch tested?
Manually tested as part of https://github.com/apache/spark/pull/15659 by having the build succeed.
cc joshrosen
Author: Holden Karau <holden@us.ibm.com>
Closes#15860 from holdenk/minor-fix-release-build-script.
## What changes were proposed in this pull request?
One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll
## How was this patch tested?
Existing tests
Author: Guoqiang Li <witgo@qq.com>
Closes#15830 from witgo/SPARK-18375_netty-4.0.42.
## What changes were proposed in this pull request?
Try excluding org.json:json from hive-exec dep as it's Cat X now. It may be the case that it's not used by the part of Hive Spark uses anyway.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#15798 from srowen/SPARK-18262.
## What changes were proposed in this pull request?
1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark
## How was this patch tested?
Existing doctests & unit tests pass
Author: Jagadeesan <as2@us.ibm.com>
Closes#15514 from jagadeesanas2/SPARK-17960.
## What changes were proposed in this pull request?
`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15548 from ueshin/issues/SPARK-17985.
This reverts commit bfe7885aee.
The commit caused build failures on Hadoop 2.2 profile:
```
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
[error] var numBytes = IOUtils.read(gzInputStream, buf)
[error] ^
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
[error] numBytes = IOUtils.read(gzInputStream, buf)
[error] ^
```
## What changes were proposed in this pull request?
`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15525 from ueshin/issues/SPARK-17985.
## What changes were proposed in this pull request?
Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL
## How was this patch tested?
Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
## What changes were proposed in this pull request?
This PR adds the Kafka 0.10 subproject to the build infrastructure. This makes sure Kafka 0.10 tests are only triggers when it or of its dependencies change.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#15355 from hvanhovell/SPARK-17782.
## What changes were proposed in this pull request?
This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
tdas did most of work and part of them was inspired by koeninger's work.
### Introduction
The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int
The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
### Configuration
The user can use `DataStreamReader.option` to set the following configurations.
Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
### Usage
* Subscribe to 1 topic
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1")
.load()
```
* Subscribe to multiple topics
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1,topic2")
.load()
```
* Subscribe to a pattern
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribePattern", "topic.*")
.load()
```
## How was this patch tested?
The new unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>
Closes#15102 from zsxwing/kafka-source.
## What changes were proposed in this pull request?
This PR sets the R package version while tagging releases. Note that since R doesn't accept `-SNAPSHOT` in version number field, we remove that while setting the next version
## How was this patch tested?
Tested manually by running locally
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#15223 from shivaram/sparkr-version-change.
## What changes were proposed in this pull request?
This PR includes the changes below:
1. Upgrade Univocity library from 2.1.1 to 2.2.1
This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases).
2. Remove useless `rowSeparator` variable existing in `CSVOptions`
We have this unused variable in [CSVOptions.scala#L127](29952ed096/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala (L127)) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable.
This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.
3. Set the default value of `maxCharsPerColumn` to auto-expending.
We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.
To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).
## How was this patch tested?
N/A
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15138 from HyukjinKwon/SPARK-17583.
## What changes were proposed in this pull request?
This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes.
## How was this patch tested?
The change should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#15115 from rxin/SPARK-17558.
## What changes were proposed in this pull request?
Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions several bug fixes and performance improvements we may find useful, see netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 4.1.5 but noticed it's not backwards compatible.
## How was this patch tested?
Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z architectures
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#14961 from a-roberts/netty.
## What changes were proposed in this pull request?
Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)"
## How was this patch tested?
Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian)
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#14958 from a-roberts/master.
## What changes were proposed in this pull request?
Only build PRs with -Pyarn if YARN code was modified.
## How was this patch tested?
Jenkins tests (will look to verify whether -Pyarn was included in the PR builder for this one.)
Author: Sean Owen <sowen@cloudera.com>
Closes#14892 from srowen/SPARK-17329.
## What changes were proposed in this pull request?
add build_profile_flags entry to mesos build module
## How was this patch tested?
unit tests
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14885 from mgummelt/mesos-profile.
This patch is using Apache Commons Crypto library to enable shuffle encryption support.
Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>
Closes#8880 from winningsix/SPARK-10771.
## What changes were proposed in this pull request?
Excludes the `spark-warehouse` directory from the Apache RAT checks that src/run-tests performs. `spark-warehouse` is created by some of the Spark SQL tests, as well as by `bin/spark-sql`.
## How was this patch tested?
Ran src/run-tests twice. The second time, the script failed because the first iteration
Made the change in this PR.
Ran src/run-tests a third time; RAT checks succeeded.
Author: frreiss <frreiss@us.ibm.com>
Closes#14870 from frreiss/fred-17303.
## What changes were proposed in this pull request?
Move Mesos code into a mvn module
## How was this patch tested?
unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14637 from mgummelt/mesos-module.
## What changes were proposed in this pull request?
Update to py4j 0.10.3 to enable JAVA_HOME support
## How was this patch tested?
Pyspark tests
Author: Sean Owen <sowen@cloudera.com>
Closes#14748 from srowen/SPARK-16781.
## What changes were proposed in this pull request?
Add a configurable token manager for Spark on running on yarn.
### Current Problems ###
1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.
### Changes In This Proposal ###
In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.
### Behavior Changes ###
For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).
For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.
So we still keep the same semantics as current code while add one new configuration.
### Current Status ###
- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.
## How was this patch tested?
Unit test and integrated test.
Please suggest and review, any comment is greatly appreciated.
Author: jerryshao <sshao@hortonworks.com>
Closes#14065 from jerryshao/SPARK-16342.
## What changes were proposed in this pull request?
As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages:
- Exclude transitive dependency 'jline:jline' from hive-exec module
- Remove global properties 'jline.version' and 'jline.groupId'
- Add both properties and dependency to 'scala-2.11' profile
- Add explicit dependency on 'jline:jline' to module 'spark-repl'
## How was this patch tested?
- Running mvn dependency:tree and checking for correct Jline version 2.12.1
- Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball
Author: Stefan Schulze <stefan.schulze@pentasys.de>
Closes#14429 from stsc-pentasys/SPARK-16770.
## What changes were proposed in this pull request?
New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}
This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/
The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.
I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.
This is blocked on: https://github.com/apache/spark/pull/14167
## How was this patch tested?
- manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
- spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14275 from mgummelt/unified-containerizer.
## What changes were proposed in this pull request?
Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1
The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs
## How was this patch tested?
Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.
I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present.
I don't know if this would also remove it from the assembly jar in our 1.x branches.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#14379 from a-roberts/patch-4.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
(`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#14348 from philipphoffmann/force-pull-image.
## What changes were proposed in this pull request?
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
## How was this patch tested?
I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#13051 from philipphoffmann/force-pull-image.
## What changes were proposed in this pull request?
This patch removes dev/audit-release. It was initially created to do basic release auditing. They have been unused by for the last one year+.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14342 from rxin/SPARK-16685.
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
## How was this patch tested?
No new tests, should pass the existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14150 from yanboliang/spark-16494.
## What changes were proposed in this pull request?
Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes
## How was this patch tested?
SparkR unit tests, running the above mentioned script
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#14173 from shivaram/sparkr-cran-changes.
## What changes were proposed in this pull request?
This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh.
Author: Yin Huai <yhuai@databricks.com>
Closes#14108 from yhuai/SPARK-16453.
In the `dev/run-tests.py` script we check a `Popen.retcode` for success using `retcode > 0`, but this is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode
In order to properly handle signals, we should change this to check `retcode != 0` instead.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13692 from JoshRosen/dev-run-tests-return-code-handling.
## What changes were proposed in this pull request?
A follow up PR for #13655 to fix a wrong format tag.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13665 from zsxwing/fix.
## What changes were proposed in this pull request?
We should mention that users can build Spark using multiple threads to decrease build times; either here or in "Building Spark"
## How was this patch tested?
Built on machines with between one core to 192 cores using mvn -T 1C and observed faster build times with no loss in stability
In response to the question here https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest this option as we know it works for Spark and can result in faster builds
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#13562 from a-roberts/patch-3.
## What changes were proposed in this pull request?
This PR just enables tests for sql/streaming.py and also fixes the failures.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13655 from zsxwing/python-streaming-test.
## What changes were proposed in this pull request?
Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Existing tests
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use
https://hadoop.apache.org/docs/r2.7.0/ states
"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should use 2.7.1 release and beyond."
Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x."
And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1."
I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#13556 from a-roberts/patch-2.
This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern.
Tested manually with the following reproduction of the bug:
```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13568 from JoshRosen/SPARK-12712.
## What changes were proposed in this pull request?
revived #13464
Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```
## How was this patch tested?
ran `dev/lint-java` locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13559 from techaddict/minor-3.
## What changes were proposed in this pull request?
This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
## How was this patch tested?
Jenkins unit tests, integration tests, manual tests)
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13417 from zsxwing/revert-SPARK-11753.
## What changes were proposed in this pull request?
This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.
## How was this patch tested?
This uses the existing Parquet tests.
Author: Ryan Blue <blue@apache.org>
Closes#13280 from rdblue/SPARK-9876-update-parquet.
## What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-15523
This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
## How was this patch tested?
1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
Author: Villu Ruusmann <villu.ruusmann@gmail.com>
Closes#13297 from vruusmann/update-jpmml.
## What changes were proposed in this pull request?
The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).
This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.
## How was this patch tested?
Manually running SBT/Maven builds.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#13299 from hvanhovell/SPARK-15525.
## What changes were proposed in this pull request?
Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
## How was this patch tested?
`JsonParsingOptionsSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9759 from viirya/fix-json-nonnumric.
## What changes were proposed in this pull request?
I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class.
## How was this patch tested?
Tests were moved.
Author: Reynold Xin <rxin@databricks.com>
Closes#13207 from rxin/SPARK-15424.
## What changes were proposed in this pull request?
Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL.
## How was this patch tested?
Benchmark only
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13188 from sameeragarwal/tpcds-all.
## What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#12627 from dbtsai/SPARK-14615-NewML.
## What changes were proposed in this pull request?
(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)
Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13074 from srowen/SPARK-15290.
## What changes were proposed in this pull request?
This is sort of a hot-fix for https://github.com/apache/spark/pull/13117, but, the problem is limited to Hadoop 2.2. The change is to manage `commons-io` to 2.4 for all Hadoop builds, which is only a net change for Hadoop 2.2, which was using 2.1.
## How was this patch tested?
Jenkins tests -- normal PR builder, then the `[test-hadoop2.2] [test-maven]` if successful.
Author: Sean Owen <sowen@cloudera.com>
Closes#13132 from srowen/SPARK-12972.3.
## What changes were proposed in this pull request?
(Retry of https://github.com/apache/spark/pull/13049)
- update to httpclient 4.5 / httpcore 4.4
- remove some defunct exclusions
- manage httpmime version to match
- update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)
## How was this patch tested?
Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`
Author: Sean Owen <sowen@cloudera.com>
Closes#13117 from srowen/SPARK-12972.2.
## What changes were proposed in this pull request?
- update httpcore/httpclient to latest
- centralize version management
- remove excludes that are no longer relevant according to SBT/Maven dep graphs
- also manage httpmime to match httpclient
## How was this patch tested?
Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh output
Author: Sean Owen <sowen@cloudera.com>
Closes#13049 from srowen/SPARK-12972.
## What changes were proposed in this pull request?
This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 .
## How was this patch tested?
Existing doctests & unit tests pass
Author: Holden Karau <holden@us.ibm.com>
Closes#13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
## What changes were proposed in this pull request?
Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+.
`javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version.
## How was this patch tested?
Manual test and current test cases should cover it.
Author: bomeng <bmeng@us.ibm.com>
Closes#12916 from bomeng/SPARK-14897.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/12851
Remove `SparkSession.withHiveSupport` in PySpark and instead use `SparkSession.builder. enableHiveSupport`
## How was this patch tested?
Existing tests.
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13063 from techaddict/SPARK-15072-followup.
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#12946 from koeninger/SPARK-15085.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-15148
Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.
This PR upgrades Univocity library from 2.0.2 to 2.1.0.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12923 from HyukjinKwon/SPARK-15148.
## What changes were proposed in this pull request?
Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.
## How was this patch tested?
I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.
Author: mcheah <mcheah@palantir.com>
Closes#12715 from mccheah/feature/upgrade-jersey.
## What changes were proposed in this pull request?
We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.
## How was this patch tested?
We built Spark jar and successfully ran our applications in local and cluster modes.
Author: Lining Sun <lining@gmail.com>
Closes#12901 from liningalex/master.
## What changes were proposed in this pull request?
The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12765 from andrewor14/python-spark-session-more.
## What changes were proposed in this pull request?
This PR copy the thrift-server from hive-service-1.2 (including TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12764 from davies/thrift_server.
## What changes were proposed in this pull request?
This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.
## How was this patch tested?
Scala-style checks passed.
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes#12416 from pravingadakh/SPARK-14613.
## What changes were proposed in this pull request?
Currently, `build/mvn` provides a convenient option, `--force`, in order to use the recommended version of maven without changing PATH environment variable. However, there were two problems.
- `dev/lint-java` does not use the newly installed maven.
```bash
$ ./build/mvn --force clean
$ ./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
```
- It's not easy to type `--force` option always.
If '--force' option is used once, we had better prefer the installed maven recommended by Spark.
This PR makes `build/mvn` check the existence of maven installed by `--force` option first.
According to the comments, this PR aims to the followings:
- Detect the maven version from `pom.xml`.
- Install maven if there is no or old maven.
- Remove `--force` option.
## How was this patch tested?
Manual.
```bash
$ ./build/mvn --force clean
$ ./dev/lint-java
Using `mvn` from path: /Users/dongjoon/spark/build/apache-maven-3.3.9/bin/mvn
...
$ rm -rf ./build/apache-maven-3.3.9/
$ ./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12631 from dongjoon-hyun/SPARK-14867.
## What changes were proposed in this pull request?
Since `LZ4BlockInputStream.java` is not licensed to Apache Software Foundation (ASF), the Apache License header of that file is not monitored until now.
This PR aims to enable RAT checking on `LZ4BlockInputStream.java` by excluding from `dev/.rat-excludes`.
This will prevent accidental removal of Apache License header from that file.
## How was this patch tested?
Pass the Jenkins tests (Specifically, RAT check stage).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12677 from dongjoon-hyun/minor_rat_exclusion_file.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
## What changes were proposed in this pull request?
Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.
- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)
## How was this patch tested?
After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12632 from dongjoon-hyun/SPARK-14868.
## What changes were proposed in this pull request?
This PR creates a compatibility module in sql (called `hive-1-x-compatibility`), which will host HiveContext in Spark 2.0 (moving HiveContext to here will be done separately). This module is not included in assembly because only users who still want to access HiveContext need it.
## How was this patch tested?
I manually tested `sbt/sbt -Phive package` and `mvn -Phive package -DskipTests`.
Author: Yin Huai <yhuai@databricks.com>
Closes#12580 from yhuai/compatibility.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14787
The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR.
This PR upgrades Joda-Time library from 2.9 to 2.9.3.
## How was this patch tested?
`sbt scalastyle` and Jenkins tests in this PR.
closes#11847
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12552 from HyukjinKwon/SPARK-14787.
## What changes were proposed in this pull request?
This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down.
To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface.
Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence,
1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend.
2. Added functionality of killing all the running tasks in an executor.
## How was this patch tested?
ExternalClusterManagerSuite.scala was added to test this patch.
Author: Hemant Bhanawat <hemant@snappydata.io>
Closes#11723 from hbhanawat/pluggableScheduler.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.
The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.
Thanks.
## How was this patch tested?
Unit tests
mengxr tedyu holdenk
Author: DB Tsai <dbt@netflix.com>
Closes#12298 from dbtsai/dbtsai-mllib-local-build-fix.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#12241 from dbtsai/dbtsai-mllib-local-build.
This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12076 from JoshRosen/kryo3.
## What changes were proposed in this pull request?
This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:
```
"a"b,ccc,ddd
e,f,g
```
produces a data below:
- **Before**
```bash
["a"b,ccc,ddd[\n]e,f,g] <- as a value.
```
- **After**
```bash
["a"b], [ccc], [ddd]
[e], [f], [g]
```
This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.
## How was this patch tested?
Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12226 from HyukjinKwon/SPARK-14103-quote.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).
I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).
Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11796 from vanzin/SPARK-13579.
## What changes were proposed in this pull request?
Upgrade to 2.11.8 (from the current 2.11.7)
## How was this patch tested?
A manual build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#11681 from jaceklaskowski/SPARK-13825-scala-2_11_8.
## What changes were proposed in this pull request?
Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.
## How was this patch tested?
Tested by running a job on the cluster and saw 7.5% cpu savings after this change.
Author: Sital Kedia <skedia@fb.com>
Closes#12096 from sitalkedia/snappyRelease.
### What changes were proposed in this pull request?
This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
### How was this patch tested?
Existing unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12071 from hvanhovell/SPARK-14211.
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
This PR is a work in progress, and work needs to be done in the following area's:
- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.
### How was this patch tested?
Catalyst and SQL unit tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11557 from hvanhovell/ngParser.
## What changes were proposed in this pull request?
This PR moves flume back to Spark as per the discussion in the dev mail-list.
## How was this patch tested?
Existing Jenkins tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11895 from zsxwing/move-flume-back.
## What changes were proposed in this pull request?
Change lint python script to stop on first error rather than building them up so its clearer why we failed (requested by rxin). Also while in the file, remove the commented out code.
## How was this patch tested?
Manually ran lint-python script with & without pep8 errors locally and verified expected results.
Author: Holden Karau <holden@us.ibm.com>
Closes#11898 from holdenk/SPARK-13887-pylint-fast-fail.
## What changes were proposed in this pull request?
In dev/lint-r.R, `install_github` makes our builds depend on a unstable source. This may cause un-expected test failures and then build break. This PR adds a specified commit sha1 ID to `install_github` to get a stable source.
## How was this patch tested?
dev/lint-r
Author: Sun Rui <rui.sun@intel.com>
Closes#11913 from sun-rui/SPARK-14074.
## What changes were proposed in this pull request?
[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.
```xml
- <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
- <!--
<module name="LineLength">
<property name="max" value="100"/>
<property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
</module>
- -->
<module name="NoLineWrap"/>
<module name="EmptyBlock">
<property name="option" value="TEXT"/>
-167,5 +164,7
</module>
<module name="CommentsIndentation"/>
<module name="UnusedImports"/>
+ <module name="RedundantImport"/>
+ <module name="RedundantModifier"/>
```
## How was this patch tested?
Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11831 from dongjoon-hyun/SPARK-14011.
MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version.
This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11774 from JoshRosen/SPARK-13948.
As part of the goal to stop creating assemblies in Spark, this change
modifies the mvn and sbt builds to not create an assembly for examples.
Instead, dependencies are copied to the build directory (under
target/scala-xx/jars), and in the final archive, into the "examples/jars"
directory.
To avoid having to deal too much with Windows batch files, I made examples
run through the launcher library; the spark-submit launcher now has a
special mode to run examples, which adds all the necessary jars to the
spark-submit command line, and replaces the bash and batch scripts that
were used to run examples. The scripts are now just a thin wrapper around
spark-submit; another advantage is that now all spark-submit options are
supported.
There are a few glitches; in the mvn build, a lot of duplicated dependencies
get copied, because they are promoted to "compile" scope due to extra
dependencies in the examples module (such as HBase). In the sbt build,
all dependencies are copied, because there doesn't seem to be an easy
way to filter things.
I plan to clean some of this up when the rest of the tasks are finished.
When the main assembly is replaced with jars, we can remove duplicate jars
from the examples directory during packaging.
Tested by running SparkPi in: maven build, sbt build, dist created by
make-distribution.sh.
Finally: note that running the "assembly" target in sbt doesn't build
the examples anymore. You need to run "package" for that.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11452 from vanzin/SPARK-13576.
## What changes were proposed in this pull request?
Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages
- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter
They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.
I have already copied these projects to https://github.com/spark-packages
## How was this patch tested?
Jenkins tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11672 from zsxwing/remove-external-pkg.
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.
In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.
Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2
/cc zsxwing tdas davies brkyvz
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11687 from JoshRosen/py4j-0.9.2.
## What changes were proposed in this pull request?
For 2.0.0, we had better make **sbt** and **sbt plugins** up-to-date. This PR checks the status of each plugins and bumps the followings.
* sbt: 0.13.9 --> 0.13.11
* sbteclipse-plugin: 2.2.0 --> 4.0.0
* sbt-dependency-graph: 0.7.4 --> 0.8.2
* sbt-mima-plugin: 0.1.6 --> 0.1.9
* sbt-revolver: 0.7.2 --> 0.8.0
All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.)
During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result.
```
// SPARK-12896 Send only accumulator updates to driver, not TaskMetrics
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"),
-ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"),
+ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"),
```
## How was this patch tested?
Pass the Jenkins build.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11669 from dongjoon-hyun/update_mima.
## What changes were proposed in this pull request?
PR #11443 temporarily disabled MiMA check, this PR re-enables it.
One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes.
## How was this patch tested?
Tested by MiMA check triggered by Jenkins.
Author: Cheng Lian <lian@databricks.com>
Closes#11656 from liancheng/re-enable-mima.
This patch removes the need to build a full Spark assembly before running the `dev/mima` script.
- I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies.
- This required me to delete two classes full of dead code that we don't use anymore
- `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build.
- `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11178 from JoshRosen/remove-assembly-in-run-tests.
## What changes were proposed in this pull request?
This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
There are several noticeable API changes related to those returning arrays:
1. `collect`/`take`
- Old APIs in class `DataFrame`:
```scala
def collect(): Array[Row]
def take(n: Int): Array[Row]
```
- New APIs in class `Dataset[T]`:
```scala
def collect(): Array[T]
def take(n: Int): Array[T]
def collectRows(): Array[Row]
def takeRows(n: Int): Array[Row]
```
Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
1. `randomSplit`
- Old APIs in class `DataFrame`:
```scala
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
def randomSplit(weights: Array[Double]): Array[DataFrame]
```
- New APIs in class `Dataset[T]`:
```scala
def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
def randomSplit(weights: Array[Double]): Array[Dataset[T]]
```
Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one.
1. `groupBy`
Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
Other noticeable changes:
1. Dataset always do eager analysis now
We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`.
## How was this patch tested?
Existing tests do the work.
## TODO
- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>
Closes#11443 from liancheng/ds-to-df.
## What changes were proposed in this pull request?
Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around.
Supersedes https://github.com/apache/spark/pull/11524
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11631 from srowen/SPARK-13663.
## What changes were proposed in this pull request?
Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`
## How was this patch tested?
This is tested with Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11523 from srowen/SPARK-13595.
## What changes were proposed in this pull request?
This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change.
The following is the error message of current master.
```
Checkstyle checks failed at following occurrences:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1]
```
## How was this patch tested?
Manual. The following command should run correctly.
```
./dev/lint-java
mvn checkstyle:check
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11567 from dongjoon-hyun/hotfix_checkstyle_suppression.
## What changes were proposed in this pull request?
Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`.
I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope.
## How was this patch tested?
`./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest.
Author: Sean Owen <sowen@cloudera.com>
Closes#11522 from srowen/SPARK-13596.
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?
Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.
This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.
## How was this patch tested?
1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver -Dverbose > target/dependencies.txt`
1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
1. Patch applied
1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
1. Examined created spark-assembly, verified no org.codehaus packages
1. Verified that the maven dependency tree no longer references groovy
Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded
Author: Steve Loughran <stevel@hortonworks.com>
Closes#11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
## What changes were proposed in this pull request?
The PR fixes typos in an error message in dev/run-tests.py.
Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>
Closes#11467 from wjur/wjur/typos_run_tests.
## What changes were proposed in this pull request?
```
error] Expected ID character
[error] Not a valid command: common (similar: completions)
[error] Expected project ID
[error] Expected configuration
[error] Expected ':' (if selecting a configuration)
[error] Expected key
[error] Not a valid key: common (similar: commands)
[error] common/network-yarn/test
```
`common/network-yarn` is not a valid sbt project, we should change to `network-yarn`.
## How was this patch tested?
Locally run the the unit-test.
CC rxin , we should either change here, or change the sbt project name.
Author: jerryshao <sshao@hortonworks.com>
Closes#11456 from jerryshao/build-fix.
## What changes were proposed in this pull request?
As the title says, this moves the three modules currently in network/ into common/network-*. This removes one top level, non-user-facing folder.
## How was this patch tested?
Compilation and existing tests. We should run both SBT and Maven.
Author: Reynold Xin <rxin@databricks.com>
Closes#11409 from rxin/SPARK-13529.
It registers more Scala classes, including ListBuffer to support Kryo with FPGrowth.
See https://github.com/twitter/chill/releases for Chill's change log.
Author: mark800 <yky800@126.com>
Closes#11041 from mark800/master.
Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11350 from JoshRosen/update-release-scripts-for-apache-home.
Phase 1: update plugin versions, test dependencies, some example and third-party versions
Author: Sean Owen <sowen@cloudera.com>
Closes#11206 from srowen/SPARK-13324.
We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.
Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.
Author: Holden Karau <holden@us.ibm.com>
Closes#11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10608 from JoshRosen/SPARK-6363.
There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10933 from JoshRosen/build-module-fix.
This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure
Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after.
In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10885 from JoshRosen/SPARK-8725.
Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning).
cc JoshRosen who looked at the original JIRA.
Author: Holden Karau <holden@us.ibm.com>
Closes#10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1].
As required by the [design doc][2], spark-sketch should have no external dependency.
Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.
The following features will be added in future follow-up PRs:
- Serialization support
- DataFrame API integration
[1]: aac6b4d23a/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
[2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf
Author: Cheng Lian <lian@databricks.com>
Closes#10851 from liancheng/count-min-sketch.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
Include the following changes:
1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
3. Update the ActorWordCount example and add the JavaActorWordCount example
4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10744 from zsxwing/streaming-akka-2.
This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.
/cc rxin srowen
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10775 from JoshRosen/add-hadoop-2.7-profile.
This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.
There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests.
Author: Reynold Xin <rxin@databricks.com>
Closes#10752 from rxin/remove-offheap.
CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL.
This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests.
Author: Hossein <hossein@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#10766 from rxin/csv.
It was previously turned off because there was a problem with a pull request. We should turn it on now.
Author: Reynold Xin <rxin@databricks.com>
Closes#10763 from rxin/SPARK-12829.
When running the `run-tests` script, style checkers run only when any source files are modified but they should run when configuration files related to style are modified.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10754 from sarutak/SPARK-12821.
This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior).
This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10709 from JoshRosen/SPARK-9383.
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
- Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10692 from zsxwing/py4j-0.9.1.
This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script.
First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed.
I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests.
/cc zsxwing
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10704 from JoshRosen/fix-build-test-problems.
The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL.
Author: BrianLondon <brian@seatgeek.com>
Closes#10256 from BrianLondon/master.
This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact.
While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at 3119206b71...5211ab8 (diff-600376dffeb79835ede4a0b285078036)
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10693 from JoshRosen/netty-hotfix.
Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse).
This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds.
/cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath.
Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10672 from JoshRosen/enforce-netty-exclusions.
This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made:
The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling.
The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project:
- ```CatalystQl```: This implements Query and Expression parsing functionality.
- ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe.
- ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive.
cc rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#10583 from hvanhovell/SPARK-12575.
rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559
- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#10584 from felixcheung/rremovedeprecated.
This patch aims to fix another potential source of flakiness in the `dev/test-dependencies.sh` script.
pwendell's original patch and my version used `$(date +%s | tail -c6)` to generate a suffix to use when installing temporary Spark versions into the local Maven cache, but this value only changes once per second and thus is highly collision-prone when concurrent builds launch on AMPLab Jenkins. In order to reduce the potential for conflicts, this patch updates the script to call Python's random number generator instead.
I also fixed a bug in how we captured the original project version; the bug was causing the exit handler code to fail.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10558 from JoshRosen/build-dep-tests-round-3.
There are a couple of places in the `dev/run-tests-*.py` scripts which deal with Hadoop profiles, but the set of profiles that they handle does not include all Hadoop profiles defined in our POM. Similarly, the `hadoop-2.2` and `hadoop-2.6` profiles were missing from `dev/deps`.
This patch updates these scripts to include all four Hadoop profiles defined in our POM.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10565 from JoshRosen/add-missing-hadoop-profiles-in-test-scripts.
This patch includes multiple fixes for the `dev/test-dependencies.sh` script (which was introduced in #10461):
- Use `build/mvn --force` instead of `mvn` in one additional place.
- Explicitly set a zero exit code on success.
- Set `LC_ALL=C` to make `sort` results agree across machines (see https://stackoverflow.com/questions/28881/).
- Set `should_run_build_tests=True` for `build` module (this somehow got lost).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10543 from JoshRosen/dep-script-fixes.
This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath.
This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs.
This patch is based on pwendell's work in #8531.
Closes#8531.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>
Closes#10461 from JoshRosen/SPARK-10359.
This patch fixes a handful of minor bugs in the `dev/tests/pr_public_classes.sh` script, which is used by the `run_tests_jenkins` script to detect the addition of new public classes:
- Account for differences between BSD and GNU `sed` in order to allow the script to run on OS X.
- Diff `$ghprbActualCommit^...$ghprbActualCommit ` instead of `master...$ghprbActualCommit`: since `ghprbActualCommit` is a merge commit which results from merging the PR into the target branch, this will give us the desired diff and will avoid certain race-conditions which could lead to false-positives.
- Use `echo -e` instead of `echo` so that newline characters are handled correctly in output. This should fix a formatting glitch which caused the output to appear on a single line in the GitHub comment (see [the SC2028 page](https://github.com/koalaman/shellcheck/wiki/SC2028) on the Shellcheck wiki for more details).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10455 from JoshRosen/fix-pr-public-classes-test.
fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx'
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#10463 from kiszk/SPARK-12502.
Currently, `dev/scalastyle` invokes SBT four times, but these invocations can be replaced with a single invocation, saving about one minute of build time.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10151 from JoshRosen/speed-up-scalastyle.
This replaces https://github.com/apache/spark/pull/9696
Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.
Suggest fixing those TODOs in a separate PR(s).
More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
Also fix some of the minor violations that didn't require sweeping changes.
Apologies for the previous botched PRs - I finally figured out the issue.
cr: JoshRosen, pwendell
> I state that the contribution is my original work, and I license the work to the project under the project's open source license.
Author: Dmitry Erastov <derastov@gmail.com>
Closes#9867 from dskrvk/master.
This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine.
Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task.
`dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead.
/cc dragos marmbrus pwendell srowen
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9575 from JoshRosen/SPARK-7841.
Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9450 from JoshRosen/upgrade-to-scala-2.10.5.
This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes.
From the original PR description (by brennonyork):
Currently a few things are left out that, could and I think should, be smaller JIRA's after this.
1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out.
2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO.
3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well.
Closes#7401.
Author: Brennon York <brennon.york@capitalone.com>
Closes#9161 from JoshRosen/run-tests-jenkins-refactoring.
Our merge script now turns
```
[SPARK-1234][SPARK-1235][SPARK-1236][SQL] description
```
into
```
[SPARK-1234] [SPARK-1235] [SPARK-1236] [SQL] description
```
The extra spaces are more annoying in git since the first line of a git commit is supposed to be very short.
Doctest passes with the following command:
```
python -m doctest merge_spark_pr.py
```
Author: Reynold Xin <rxin@databricks.com>
Closes#9156 from rxin/SPARK-11169.
Removes any extra strings from the Java version, fixing subsequent integer parsing.
This is required since some OpenJDK versions (specifically in Debian testing), append an extra "-internal" string to the version field.
Author: Jakob Odersky <jodersky@gmail.com>
Closes#9111 from jodersky/fixtestrunner.
Spark's release packaging scripts used to live in a separate repository. Although these scripts are now part of the Spark repo, there are some minor patches made against the old repos that are missing in Spark's copy of the script. This PR ports those changes.
/cc shivaram, who originally submitted these changes against https://github.com/rxin/spark-utils
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8986 from JoshRosen/port-release-build-fixes-from-rxin-repo.
As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them.
Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8793 from JoshRosen/remove-jenkins-ssh-to-master.
The calculation of Spark version is downloading
Scala and Zinc in the build directory which is
inflating the size of the source distribution.
Reseting the repo before packaging the source
distribution fix this issue.
Author: Luciano Resende <lresende@apache.org>
Closes#8774 from lresende/spark-10511.
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8437 from vanzin/test-tags.
Location of JIRAError has moved between old and new versions of python-jira package.
Longer term it probably makes sense to pin to specific versions (as mentioned in https://issues.apache.org/jira/browse/SPARK-10498 ) but for now, making release tools works with both new and old versions of python-jira.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8661 from holdenk/SPARK-10497-release-utils-does-not-work-with-new-jira-python.
This is just some small glue code to actually make use of the
AMPLAB_JENKINS_BUILD_TOOL switch. As far as I can tell, we actually
don't currently use the Maven support in the tool even though it exists.
This patch switches to Maven when the PR title contains "test-maven".
There are a few small other pieces of cleanup in the patch as well.
Author: Patrick Wendell <patrick@databricks.com>
Closes#7878 from pwendell/maven-tests.
JoshRosen we'd like to check the SparkR source code with the `dev/lint-r` script on the Jenkins. I tried to incorporate the script into `dev/run-test.py`. Could you review it when you have time?
shivaram I modified `dev/lint-r` and `dev/lint-r.R` to install lintr package into a local directory(`R/lib/`) and to exit with a lint status. Could you review it?
- [[SPARK-8505] Add settings to kick `lint-r` from `./dev/run-test.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8505)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#7883 from yu-iskw/SPARK-8505.
The current `release-build.sh` has a typo which breaks snapshot publication for Scala 2.11. We should change the Scala version to 2.11 and clean before building a 2.11 snapshot.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8325 from JoshRosen/fix-2.11-snapshots.
This update contains some code changes to the release scripts that allow easier nightly publishing. I've been using these new scripts on Jenkins for cutting and publishing nightly snapshots for the last month or so, and it has been going well. I'd like to get them merged back upstream so this can be maintained by the community.
The main changes are:
1. Separates the release tagging from various build possibilities for an already tagged release (`release-tag.sh` and `release-build.sh`).
2. Allow for injecting credentials through the environment, including GPG keys. This is then paired with secure key injection in Jenkins.
3. Support for copying build results to a remote directory, and also "rotating" results, e.g. the ability to keep the last N copies of binary or doc builds.
I'm happy if anyone wants to take a look at this - it's not user facing but an internal utility used for generating releases.
Author: Patrick Wendell <patrick@databricks.com>
Closes#7411 from pwendell/release-script-updates and squashes the following commits:
74f9beb [Patrick Wendell] Moving maven build command to a variable
233ce85 [Patrick Wendell] [SPARK-1517] Refactor release scripts to facilitate nightly publishing
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#8092 from tdas/SPARK-9727 and squashes the following commits:
b1b01fd [Tathagata Das] Updated streaming kinesis project name
This PR is based on #4229, thanks prabeesh.
Closes#4229
Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>
Closes#7833 from zsxwing/pr4229 and squashes the following commits:
9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
4ad6819 [Mike Dusenberry] Documenting the and parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.