## What changes were proposed in this pull request?
Shade JPMML classes (`org.jpmml.**`) and related PMML model classes (`org.dmg.pmml.**`). This insulates downstream users from the version of JPMML in Spark, allows us to upgrade more freely, and allows downstream users to use a different version. JPMML minor releases are not generally forwards/backwards compatible.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#18584 from srowen/SPARK-15526.
## What changes were proposed in this pull request?
- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17150 from srowen/SPARK-19810.
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.
Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
## What changes were proposed in this pull request?
Update hadoop-2.7 profile's curator version to 2.7.1, more see [SPARK-13933](https://issues.apache.org/jira/browse/SPARK-13933).
## How was this patch tested?
manual tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18247 from wangyum/SPARK-13933.
This change adds an abstraction and LevelDB implementation for a key-value
store that will be used to store UI and SHS data.
The interface is described in KVStore.java (see javadoc). Specifics
of the LevelDB implementation are discussed in the javadocs of both
LevelDB.java and LevelDBTypeInfo.java.
Included also are a few small benchmarks just to get some idea of
latency. Because they're too slow for regular unit test runs, they're
disabled by default.
Tested with the included unit tests, and also as part of the overall feature
implementation (including running SHS with hundreds of apps).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#17902 from vanzin/shs-ng/M1.
## What changes were proposed in this pull request?
Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
(this is the successor to #12004; I can't re-open it)
## How was this patch tested?
Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
Author: Steve Loughran <stevel@apache.org>
Author: Steve Loughran <stevel@hortonworks.com>
Closes#17834 from steveloughran/cloud/SPARK-7481-current.
## What changes were proposed in this pull request?
Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17803 from srowen/SPARK-20523.
Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
Without this change, because of incompatibily between Jetty versions,
Spark fails to compile when built against Hadoop 3
## How was this patch tested?
Unit tests being run.
Author: Mark Grover <mark@apache.org>
Closes#17790 from markgrover/spark-20514.
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17746 from yanboliang/spark-20449.
## What changes were proposed in this pull request?
This PR proposes two things as below:
- Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build
Due to a different dependency resolution in SBT & Unidoc by an unknown reason, the documentation build fails on a specific machine & environment in Jenkins but it was unable to reproduce.
So, this PR just checks an environment variable `AMPLAB_JENKINS_BUILD_PROFILE` that is set in Hadoop 2.6 SBT build against branches on Jenkins, and then disables Unidoc build. **Note that PR builder will still build it with Hadoop 2.6 & SBT.**
```
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
...
```
I checked the environment variables from the logs (first bit) as below:
- **spark-master-test-sbt-hadoop-2.6** (this one is being failed) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
SPARK_BRANCH=master
AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6 <- I use this variable
AMPLAB_JENKINS="true"
```
- spark-master-test-sbt-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
SPARK_BRANCH=master
AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.7
AMPLAB_JENKINS="true"
```
- spark-master-test-maven-hadoop-2.6 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
HADOOP_PROFILE=hadoop-2.6
HADOOP_VERSION=
SPARK_BRANCH=master
AMPLAB_JENKINS="true"
```
- spark-master-test-maven-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
HADOOP_PROFILE=hadoop-2.7
HADOOP_VERSION=
SPARK_BRANCH=master
AMPLAB_JENKINS="true"
```
- PR builder - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75843/consoleFull
```
JENKINS_MASTER_HOSTNAME=amp-jenkins-master
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
```
Assuming from other logs in branch-2.1
- SBT & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
SPARK_BRANCH=branch-2.1
AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6
AMPLAB_JENKINS="true"
```
- Maven & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-maven-hadoop-2.6/lastBuild/consoleFull
```
JAVA_HOME=/usr/java/jdk1.8.0_60
JAVA_7_HOME=/usr/java/jdk1.7.0_79
HADOOP_PROFILE=hadoop-2.6
HADOOP_VERSION=
SPARK_BRANCH=branch-2.1
AMPLAB_JENKINS="true"
```
We have been using the same convention for those variables. These are actually being used in `run-tests.py` script - here https://github.com/apache/spark/blob/master/dev/run-tests.py#L519-L520
- Revert the previous try
After https://github.com/apache/spark/pull/17651, it seems the build still fails on SBT Hadoop 2.6 master.
I am unable to reproduce this - https://github.com/apache/spark/pull/17477#issuecomment-294094092 and the reviewer was too. So, this got merged as it looks the only way to verify this is to merge it currently (as no one seems able to reproduce this).
## How was this patch tested?
I only checked `is_hadoop_version_2_6 = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6"` is working fine as expected as below:
```python
>>> import collections
>>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.6"})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
False
>>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.7"})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
True
>>> os = collections.namedtuple('os', 'environ')(environ={})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
True
```
I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.
Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17669 from HyukjinKwon/revert-SPARK-20343.
## What changes were proposed in this pull request?
This PR proposes to force Avro's version to 1.7.7 in core to resolve the build failure as below:
```
[error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData
[error] writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema))
[error]
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
Note that this is a hack and should be removed in the future.
## How was this patch tested?
I only tested this actually overrides the dependency.
I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.
Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17651 from HyukjinKwon/SPARK-20343-sbt.
- Add dependency on aws-java-sdk-sts
- Replace SerializableAWSCredentials with new SerializableCredentialsProvider interface
- Make KinesisReceiver take SerializableCredentialsProvider as argument and
pass credential provider to KCL
- Add new implementations of KinesisUtils.createStream() that take STS
arguments
- Make JavaKinesisStreamSuite test the entire KinesisUtils Java API
- Update KCL/AWS SDK dependencies to 1.7.x/1.11.x
## What changes were proposed in this pull request?
[JIRA link with detailed description.](https://issues.apache.org/jira/browse/SPARK-19405)
* Replace SerializableAWSCredentials with new SerializableKCLAuthProvider class that takes 5 optional config params for configuring AWS auth and returns the appropriate credential provider object
* Add new public createStream() APIs for specifying these parameters in KinesisUtils
## How was this patch tested?
* Manually tested using explicit keypair and instance profile to read data from Kinesis stream in separate account (difficult to write a test orchestrating creation and assumption of IAM roles across separate accounts)
* Expanded JavaKinesisStreamSuite to test the entire Java API in KinesisUtils
## License acknowledgement
This contribution is my original work and that I license the work to the project under the project’s open source license.
Author: Budde <budde@amazon.com>
Closes#16744 from budde/master.
- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings
For the future:
- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16871 from srowen/SPARK-19493.
## What changes were proposed in this pull request?
- Remove support for Hadoop 2.5 and earlier
- Remove reflection and code constructs only needed to support multiple versions at once
- Update docs to reflect newer versions
- Remove older versions' builds and profiles.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16810 from srowen/SPARK-19464.
## What changes were proposed in this pull request?
According to the discussion on #16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now.
https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E
This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes.
## How was this patch tested?
Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16751 from dongjoon-hyun/SPARK-19409.
**What changes were proposed in this pull request?**
Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up
**How was this patch tested?**
Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#16616 from a-roberts/Hadoop265Bumper.
## What changes were proposed in this pull request?
Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16568 from zsxwing/SPARK-18971.
## What changes were proposed in this pull request?
Updates to libthrift 0.9.3 to address a CVE.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#16530 from srowen/SPARK-18997.
Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.
Alternative to #16303.
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#16311 from ryan-williams/tt.
## What changes were proposed in this pull request?
I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6.
Author: Yin Huai <yhuai@databricks.com>
Closes#16359 from yhuai/SPARK-18951.
## What changes were proposed in this pull request?
Upgrading KCL version from 1.6.1 to 1.6.2. Without this upgrade, Spark cannot consume
from a stream that includes aggregated records.
This change was already commited against an older version of Spark. We need to
apply the same thing to master.
## How was this patch tested?
Manual testing using dump.py:
https://gist.github.com/boneill42/020dde814346c6b4ad0ba28406c3ea10
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Brian O'Neill <bone@alumni.brown.edu>
Closes#16236 from boneill42/master.
## What changes were proposed in this pull request?
This PR is to upgrade:
```
sbt: 0.13.11 -> 0.13.13,
zinc: 0.3.9 -> 0.3.11,
maven-assembly-plugin: 2.6 -> 3.0.0
maven-compiler-plugin: 3.5.1 -> 3.6.
maven-jar-plugin: 2.6 -> 3.0.2
maven-javadoc-plugin: 2.10.3 -> 2.10.4
maven-source-plugin: 2.4 -> 3.0.1
org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
```
The sbt release notes since the last version we used are: [v0.13.12](https://github.com/sbt/sbt/releases/tag/v0.13.12) and [v0.13.13 ](https://github.com/sbt/sbt/releases/tag/v0.13.13).
## How was this patch tested?
Pass build and the existing tests.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#16069 from weiqingy/SPARK-18638.
## What changes were proposed in this pull request?
Force update to latest Netty 3.9.x, for dependencies like Flume, to resolve two CVEs. 3.9.2 is the first version that resolves both, and, this is the latest in the 3.9.x line.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#16102 from srowen/SPARK-18586.
## What changes were proposed in this pull request?
This patch bumps master branch version to 2.2.0-SNAPSHOT.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#16126 from rxin/SPARK-18695.
## What changes were proposed in this pull request?
org.codehaus.janino:janino depends on org.codehaus.janino:commons-compiler and we have been upgraded to org.codehaus.janino:janino 3.0.0.
However, seems we are still pulling in org.codehaus.janino:commons-compiler 2.7.6 because of calcite. It looks like an accident because we exclude janino from calcite (see here https://github.com/apache/spark/blob/branch-2.1/pom.xml#L1759). So, this PR upgrades org.codehaus.janino:commons-compiler to 3.0.0.
## How was this patch tested?
jenkins
Author: Yin Huai <yhuai@databricks.com>
Closes#16025 from yhuai/janino-commons-compile.
## What changes were proposed in this pull request?
This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
This PR roughly fixes several things as below:
- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
[error] * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
```
- Fix an exception annotation and remove code backticks in `throws` annotation
Currently, sbt unidoc with Java 8 complains as below:
```
[error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
[error] * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
```
`throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
- Fix `[[http..]]` to `<a href="http..."></a>`.
```diff
- * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
- * blog page]].
+ * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
+ * Oracle blog page</a>.
```
`[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
- It seems class can't have `return` annotation. So, two cases of this were removed.
```
[error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
[error] * return New instance of IsotonicRegression.
```
- Fix < to `<` and > to `>` according to HTML rules.
- Fix `</p>` complaint
- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
## How was this patch tested?
Manually tested by `jekyll build` with Java 7 and 8
```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```
```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```
Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15999 from HyukjinKwon/SPARK-3359-errors.
## What changes were proposed in this pull request?
This PR proposes/fixes two things.
- Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.
```
[error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
[error] /** group setParam */
[error] ^
[error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
[error] * tparam FeaturesType Type of input features. E.g., <code>Vector</code>
[error] ^
...
```
It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).
- Print `example` in javadoc.
Currently, there are few `example` tag in several places.
```
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example This operation might be used to evaluate a graph
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example We might use this operation to change the vertex values
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example This function might be used to initialize edge
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example This function might be used to initialize edge
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example This function might be used to initialize edge
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example We can use this function to compute the in-degree of each
./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala: * example This function is used to update the vertices with new values based on external data.
./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala: * example Loads a file in the following format:
./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala: * example This function is used to update the vertices with new
./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala: * example This function can be used to filter the graph based on some property, without
./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala: * example addImports("org.apache.spark.SparkContext")
./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
```
**Before**
<img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">
**After**
<img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png">
## How was this patch tested?
Maunally tested by `jekyll build` with Java 7 and 8
```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```
```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```
Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15939 from HyukjinKwon/SPARK-3359-javadoc.
## What changes were proposed in this pull request?
One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll
## How was this patch tested?
Existing tests
Author: Guoqiang Li <witgo@qq.com>
Closes#15830 from witgo/SPARK-18375_netty-4.0.42.
## What changes were proposed in this pull request?
Try excluding org.json:json from hive-exec dep as it's Cat X now. It may be the case that it's not used by the part of Hive Spark uses anyway.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#15798 from srowen/SPARK-18262.
## What changes were proposed in this pull request?
Adds a `snapshots-and-staging profile` so that RCs of projects like Hadoop and HBase can be used in developer-only build and test runs. There's a comment above the profile telling people not to use this in production.
There's no attempt to do the same for SBT, as Ivy is different.
## How was this patch tested?
Tested by building against the Hadoop 2.7.3 RC 1 JARs
without the profile (and without any local copy of the 2.7.3 artifacts), the build failed
```
mvn install -DskipTests -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.7.3
...
[INFO] ------------------------------------------------------------------------
[INFO] Building Spark Project Launcher 2.1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.pom
[WARNING] The POM for org.apache.hadoop:hadoop-client:jar:2.7.3 is missing, no dependency information available
Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 4.482 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 17.402 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 11.252 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 13.458 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 9.043 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 16.027 s]
[INFO] Spark Project Launcher ............................. FAILURE [ 1.653 s]
[INFO] Spark Project Core ................................. SKIPPED
...
```
With the profile, the build completed
```
mvn install -DskipTests -Pyarn,hadoop-2.7,hive,snapshots-and-staging -Dhadoop.version=2.7.3
```
Author: Steve Loughran <stevel@apache.org>
Closes#14646 from steveloughran/stevel/SPARK-17058-support-asf-snapshots.
## What changes were proposed in this pull request?
`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15548 from ueshin/issues/SPARK-17985.
This reverts commit bfe7885aee.
The commit caused build failures on Hadoop 2.2 profile:
```
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
[error] var numBytes = IOUtils.read(gzInputStream, buf)
[error] ^
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
[error] numBytes = IOUtils.read(gzInputStream, buf)
[error] ^
```
## What changes were proposed in this pull request?
`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15525 from ueshin/issues/SPARK-17985.
## What changes were proposed in this pull request?
This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
tdas did most of work and part of them was inspired by koeninger's work.
### Introduction
The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int
The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
### Configuration
The user can use `DataStreamReader.option` to set the following configurations.
Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
### Usage
* Subscribe to 1 topic
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1")
.load()
```
* Subscribe to multiple topics
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1,topic2")
.load()
```
* Subscribe to a pattern
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribePattern", "topic.*")
.load()
```
## How was this patch tested?
The new unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>
Closes#15102 from zsxwing/kafka-source.
This was missing, preventing code that uses javax.crypto to properly
compile in Spark.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#15204 from vanzin/SPARK-17639.
## What changes were proposed in this pull request?
Docker tests are using older version of jersey jars (1.19), which was used in older releases of spark. In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After upgrade to new versions, docker tests are failing with AbstractMethodError. Now that spark is upgraded to 2.x jersey version, using of shaded docker jars may not be required any more. Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one.
## How was this patch tested?
Tested using existing docker-integration-tests
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#15114 from sureshthalamati/docker_testfix-spark-17473.
## What changes were proposed in this pull request?
This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes.
## How was this patch tested?
The change should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#15115 from rxin/SPARK-17558.
## What changes were proposed in this pull request?
Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions several bug fixes and performance improvements we may find useful, see netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 4.1.5 but noticed it's not backwards compatible.
## How was this patch tested?
Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z architectures
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#14961 from a-roberts/netty.
## What changes were proposed in this pull request?
This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as
WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/
This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy
## How was this patch tested?
The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.
pwendell bomeng BryanCutler can you please review it, thanks.
Author: Gurvinder Singh <gurvinder.singh@uninett.no>
Closes#13950 from gurvindersingh/rproxy.
## What changes were proposed in this pull request?
Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)"
## How was this patch tested?
Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian)
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#14958 from a-roberts/master.
This patch is using Apache Commons Crypto library to enable shuffle encryption support.
Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>
Closes#8880 from winningsix/SPARK-10771.
## What changes were proposed in this pull request?
Move Mesos code into a mvn module
## How was this patch tested?
unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14637 from mgummelt/mesos-module.
## What changes were proposed in this pull request?
As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages:
- Exclude transitive dependency 'jline:jline' from hive-exec module
- Remove global properties 'jline.version' and 'jline.groupId'
- Add both properties and dependency to 'scala-2.11' profile
- Add explicit dependency on 'jline:jline' to module 'spark-repl'
## How was this patch tested?
- Running mvn dependency:tree and checking for correct Jline version 2.12.1
- Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball
Author: Stefan Schulze <stefan.schulze@pentasys.de>
Closes#14429 from stsc-pentasys/SPARK-16770.
## What changes were proposed in this pull request?
New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}
This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/
The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.
I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.
This is blocked on: https://github.com/apache/spark/pull/14167
## How was this patch tested?
- manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
- spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14275 from mgummelt/unified-containerizer.
## What changes were proposed in this pull request?
Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1
The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs
## How was this patch tested?
Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.
I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present.
I don't know if this would also remove it from the assembly jar in our 1.x branches.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#14379 from a-roberts/patch-4.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
(`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#14348 from philipphoffmann/force-pull-image.
## What changes were proposed in this pull request?
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
## How was this patch tested?
I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#13051 from philipphoffmann/force-pull-image.
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c
## How was this patch tested?
No new tests, should pass the existing ones.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#14150 from yanboliang/spark-16494.
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14130 from rxin/SPARK-16477.
## What changes were proposed in this pull request?
Fixed the maven build for #13983
## How was this patch tested?
The existing tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14084 from zsxwing/fix-maven.
## What changes were proposed in this pull request?
New Kafka consumer api for the released 0.10 version of Kafka
## How was this patch tested?
Unit tests, manual tests
Author: cody koeninger <cody@koeninger.org>
Closes#11863 from koeninger/kafka-0.9.
## What changes were proposed in this pull request?
It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in the build: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/. It seems that passing `-javabootclasspath` to ScalaDoc using scala-maven-plugin ends up preventing the Scala library classes from being added to scalac's internal class path, causing compilation errors while building doc-jars.
There might be a principled fix to this inside of the scala-maven-plugin itself, but for now this patch configures the build to omit the `-javabootclasspath` option during Maven doc-jar generation.
## How was this patch tested?
Tested manually with `build/mvn clean install -DskipTests=true` when `JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify that the final POM changes were scoped correctly: https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653
/cc vanzin and yhuai for review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13573 from JoshRosen/SPARK-15839.
## What changes were proposed in this pull request?
Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Existing tests
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use
https://hadoop.apache.org/docs/r2.7.0/ states
"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should use 2.7.1 release and beyond."
Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x."
And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1."
I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master.
Author: Adam Roberts <aroberts@uk.ibm.com>
Closes#13556 from a-roberts/patch-2.
## What changes were proposed in this pull request?
Change the way spark picks up version information. Also embed the build information to better identify the spark version running.
More context can be found here : https://github.com/apache/spark/pull/12152
## How was this patch tested?
Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>
![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#13061 from dhruve/impr/SPARK-14279.
This helps with preventing jdk8-specific calls being checked in,
because PR builders are running the compiler with the wrong settings.
If the JAVA_7_HOME env variable is set, assume it points at
a jdk7 and use its rt.jar when invoking javac. For zinc, just run
it with jdk7, and disable it when building jdk8-specific code.
A big note for sbt usage: adding the bootstrap options forces sbt
to fork the compiler, and that disables incremental compilation.
That means that it's really not convenient to use for normal
development, but should be ok for automated builds.
Tested with JAVA_HOME=jdk8 and JAVA_7_HOME=jdk7:
- mvn + zinc
- mvn sans zinc
- sbt
Verified that in all cases, jdk8-specific library calls fail to
compile.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13272 from vanzin/SPARK-15451.
## What changes were proposed in this pull request?
This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
## How was this patch tested?
Jenkins unit tests, integration tests, manual tests)
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13417 from zsxwing/revert-SPARK-11753.
## What changes were proposed in this pull request?
This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.
## How was this patch tested?
This uses the existing Parquet tests.
Author: Ryan Blue <blue@apache.org>
Closes#13280 from rdblue/SPARK-9876-update-parquet.
## What changes were proposed in this pull request?
The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).
This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.
## How was this patch tested?
Manually running SBT/Maven builds.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#13299 from hvanhovell/SPARK-15525.
## What changes were proposed in this pull request?
Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
## How was this patch tested?
`JsonParsingOptionsSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9759 from viirya/fix-json-nonnumric.
## What changes were proposed in this pull request?
I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class.
## How was this patch tested?
Tests were moved.
Author: Reynold Xin <rxin@databricks.com>
Closes#13207 from rxin/SPARK-15424.
## What changes were proposed in this pull request?
(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)
Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#13074 from srowen/SPARK-15290.
## What changes were proposed in this pull request?
This is sort of a hot-fix for https://github.com/apache/spark/pull/13117, but, the problem is limited to Hadoop 2.2. The change is to manage `commons-io` to 2.4 for all Hadoop builds, which is only a net change for Hadoop 2.2, which was using 2.1.
## How was this patch tested?
Jenkins tests -- normal PR builder, then the `[test-hadoop2.2] [test-maven]` if successful.
Author: Sean Owen <sowen@cloudera.com>
Closes#13132 from srowen/SPARK-12972.3.
## What changes were proposed in this pull request?
(Retry of https://github.com/apache/spark/pull/13049)
- update to httpclient 4.5 / httpcore 4.4
- remove some defunct exclusions
- manage httpmime version to match
- update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)
## How was this patch tested?
Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`
Author: Sean Owen <sowen@cloudera.com>
Closes#13117 from srowen/SPARK-12972.2.
## What changes were proposed in this pull request?
- update httpcore/httpclient to latest
- centralize version management
- remove excludes that are no longer relevant according to SBT/Maven dep graphs
- also manage httpmime to match httpclient
## How was this patch tested?
Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh output
Author: Sean Owen <sowen@cloudera.com>
Closes#13049 from srowen/SPARK-12972.
## What changes were proposed in this pull request?
Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+.
`javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version.
## How was this patch tested?
Manual test and current test cases should cover it.
Author: bomeng <bmeng@us.ibm.com>
Closes#12916 from bomeng/SPARK-14897.
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#12946 from koeninger/SPARK-15085.
## What changes were proposed in this pull request?
Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt
## How was this patch tested?
Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test
## Other comments
Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348
Author: Luciano Resende <lresende@apache.org>
Closes#12508 from lresende/docker.
## What changes were proposed in this pull request?
Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.
## How was this patch tested?
I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.
Author: mcheah <mcheah@palantir.com>
Closes#12715 from mccheah/feature/upgrade-jersey.
## What changes were proposed in this pull request?
We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.
## How was this patch tested?
We built Spark jar and successfully ran our applications in local and cluster modes.
Author: Lining Sun <lining@gmail.com>
Closes#12901 from liningalex/master.
## What changes were proposed in this pull request?
Update Jetty 8.1 to the latest 2016/02 release, from a 2013/10 release, for security and bug fixes. This does not resolve the JIRA necessarily, as it's still worth considering an update to 9.3.
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#12842 from srowen/SPARK-14897.
## What changes were proposed in this pull request?
This PR copy the thrift-server from hive-service-1.2 (including TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12764 from davies/thrift_server.
## What changes were proposed in this pull request?
This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.
## How was this patch tested?
Scala-style checks passed.
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes#12416 from pravingadakh/SPARK-14613.
Spark's published POMs are supposed to be flattened and not contain variable substitution (see SPARK-3812), but the dummy dependency that was required for this was accidentally removed. We should re-introduce this dependency in order to fix an issue where the un-flattened POMs cause the wrong dependencies to be included in Scala 2.10 published POMs.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12706 from JoshRosen/SPARK-14925-published-poms-should-be-flattened.
First, make all dependencies in the examples module provided, and explicitly
list a couple of ones that somehow are promoted to compile by maven. This
means that to run streaming examples, the streaming connector package needs
to be provided to run-examples using --packages or --jars, just like regular
apps.
Also, remove a couple of outdated examples. HBase has had Spark bindings for
a while and is even including them in the HBase distribution in the next
version, making the examples obsolete. The same applies to Cassandra, which
seems to have a proper Spark binding library already.
I just tested the build, which passes, and ran SparkPi. The examples jars
directory now has only two jars:
```
$ ls -1 examples/target/scala-2.11/jars/
scopt_2.11-3.3.0.jar
spark-examples_2.11-2.0.0-SNAPSHOT.jar
```
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#12544 from vanzin/SPARK-14744.
## What changes were proposed in this pull request?
This PR creates a compatibility module in sql (called `hive-1-x-compatibility`), which will host HiveContext in Spark 2.0 (moving HiveContext to here will be done separately). This module is not included in assembly because only users who still want to access HiveContext need it.
## How was this patch tested?
I manually tested `sbt/sbt -Phive package` and `mvn -Phive package -DskipTests`.
Author: Yin Huai <yhuai@databricks.com>
Closes#12580 from yhuai/compatibility.
Spark SQL's POM hardcodes a dependency on `spark-sketch_2.11`, which causes Scala 2.10 builds to include the `_2.11` dependency. This is harmless since `spark-sketch` is a pure-Java module (see #12334 for a discussion of dropping the Scala version suffixes from these modules' artifactIds), but it's confusing to people looking at the published POMs.
This patch fixes this by using `${scala.binary.version}` to substitute the correct suffix, and also adds a set of Maven Enforcer rules to ensure that `_2.11` artifacts are not used in 2.10 builds (and vice-versa).
/cc ahirreddy, who spotted this issue.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12563 from JoshRosen/fix-sketch-scala-version.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14787
The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR.
This PR upgrades Joda-Time library from 2.9 to 2.9.3.
## How was this patch tested?
`sbt scalastyle` and Jenkins tests in this PR.
closes#11847
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12552 from HyukjinKwon/SPARK-14787.
## What changes were proposed in this pull request?
Move json4s, breeze dependency declaration into parent
## How was this patch tested?
Should be no functional change, but Jenkins tests will test that.
Author: Sean Owen <sowen@cloudera.com>
Closes#12390 from srowen/SPARK-14612.
Add integration tests based on docker to test DB2 JDBC dialect support
Author: Luciano Resende <lresende@apache.org>
Closes#9893 from lresende/SPARK-10521.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.
The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.
Thanks.
## How was this patch tested?
Unit tests
mengxr tedyu holdenk
Author: DB Tsai <dbt@netflix.com>
Closes#12298 from dbtsai/dbtsai-mllib-local-build-fix.
## What changes were proposed in this pull request?
Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```
## How was this patch tested?
After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12242 from dongjoon-hyun/SPARK-14465.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#12241 from dbtsai/dbtsai-mllib-local-build.
This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12076 from JoshRosen/kryo3.
## What changes were proposed in this pull request?
This splits commons.httpclient.version from commons.httpcore.version, since these two versions do not necessarily have to be the same. This change may follow up with an up-to-date version of the httpclient/httpcore libraries.
The latest 4.3.x httpclient version as of writing is 4.3.6 and the latest 4.3.x httpcore version as of writing is 4.3.3. This change would be a prerequisite for potentially moving to this new bugfix version.
## How was this patch tested?
no version change was made for httpclient/httpcore versions
mvn package
Author: Aaron Tokhy <tokaaron@amazon.com>
Closes#12245 from atokhy/pull-request.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.
Most changes are just noise to fix the logging configs.
For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11941 from vanzin/SPARK-14134.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).
I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).
Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11796 from vanzin/SPARK-13579.
## What changes were proposed in this pull request?
Upgrade to 2.11.8 (from the current 2.11.7)
## How was this patch tested?
A manual build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#11681 from jaceklaskowski/SPARK-13825-scala-2_11_8.
## What changes were proposed in this pull request?
Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.
## How was this patch tested?
Tested by running a job on the cluster and saw 7.5% cpu savings after this change.
Author: Sital Kedia <skedia@fb.com>
Closes#12096 from sitalkedia/snappyRelease.
This patch fixes a compilation / build break in Spark's `java8-tests` and refactors their POM to simplify the build. See individual commit messages for more details.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12073 from JoshRosen/fix-java8-tests.
## What changes were proposed in this pull request?
Exclude jline from curator-recipes since it conflicts with scala 2.11 when running spark-shell. Should not affect scala 2.10 since it is builtin.
## How was this patch tested?
Ran spark-shell manually.
Author: Michel Lemay <mlemay@gmail.com>
Closes#12043 from michellemay/spark-13710-fix-jline-on-windows.
### What changes were proposed in this pull request?
This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
### How was this patch tested?
Existing unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12071 from hvanhovell/SPARK-14211.
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
This PR is a work in progress, and work needs to be done in the following area's:
- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.
### How was this patch tested?
Catalyst and SQL unit tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11557 from hvanhovell/ngParser.
## What changes were proposed in this pull request?
This PR moves flume back to Spark as per the discussion in the dev mail-list.
## How was this patch tested?
Existing Jenkins tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11895 from zsxwing/move-flume-back.
As part of the goal to stop creating assemblies in Spark, this change
modifies the mvn and sbt builds to not create an assembly for examples.
Instead, dependencies are copied to the build directory (under
target/scala-xx/jars), and in the final archive, into the "examples/jars"
directory.
To avoid having to deal too much with Windows batch files, I made examples
run through the launcher library; the spark-submit launcher now has a
special mode to run examples, which adds all the necessary jars to the
spark-submit command line, and replaces the bash and batch scripts that
were used to run examples. The scripts are now just a thin wrapper around
spark-submit; another advantage is that now all spark-submit options are
supported.
There are a few glitches; in the mvn build, a lot of duplicated dependencies
get copied, because they are promoted to "compile" scope due to extra
dependencies in the examples module (such as HBase). In the sbt build,
all dependencies are copied, because there doesn't seem to be an easy
way to filter things.
I plan to clean some of this up when the rest of the tasks are finished.
When the main assembly is replaced with jars, we can remove duplicate jars
from the examples directory during packaging.
Tested by running SparkPi in: maven build, sbt build, dist created by
make-distribution.sh.
Finally: note that running the "assembly" target in sbt doesn't build
the examples anymore. You need to run "package" for that.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11452 from vanzin/SPARK-13576.
## What changes were proposed in this pull request?
Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages
- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter
They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.
I have already copied these projects to https://github.com/spark-packages
## How was this patch tested?
Jenkins tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11672 from zsxwing/remove-external-pkg.
## What changes were proposed in this pull request?
Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around.
Supersedes https://github.com/apache/spark/pull/11524
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11631 from srowen/SPARK-13663.
## What changes were proposed in this pull request?
Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`
## How was this patch tested?
This is tested with Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11523 from srowen/SPARK-13595.
## What changes were proposed in this pull request?
Remove last usage of jblas, in tests
## How was this patch tested?
Jenkins tests -- the same ones that are being modified.
Author: Sean Owen <sowen@cloudera.com>
Closes#11560 from srowen/SPARK-13715.
## What changes were proposed in this pull request?
This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change.
The following is the error message of current master.
```
Checkstyle checks failed at following occurrences:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1]
```
## How was this patch tested?
Manual. The following command should run correctly.
```
./dev/lint-java
mvn checkstyle:check
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11567 from dongjoon-hyun/hotfix_checkstyle_suppression.
## What changes were proposed in this pull request?
Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`.
I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope.
## How was this patch tested?
`./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest.
Author: Sean Owen <sowen@cloudera.com>
Closes#11522 from srowen/SPARK-13596.
## What changes were proposed in this pull request?
Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.
This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.
## How was this patch tested?
1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver -Dverbose > target/dependencies.txt`
1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
1. Patch applied
1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
1. Examined created spark-assembly, verified no org.codehaus packages
1. Verified that the maven dependency tree no longer references groovy
Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded
Author: Steve Loughran <stevel@hortonworks.com>
Closes#11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
## What changes were proposed in this pull request?
This patch moves tags and unsafe modules into common directory to remove 2 top level non-user-facing directories.
## How was this patch tested?
Jenkins should suffice.
Author: Reynold Xin <rxin@databricks.com>
Closes#11426 from rxin/SPARK-13548.
## What changes were proposed in this pull request?
As the title says, this moves the three modules currently in network/ into common/network-*. This removes one top level, non-user-facing folder.
## How was this patch tested?
Compilation and existing tests. We should run both SBT and Maven.
Author: Reynold Xin <rxin@databricks.com>
Closes#11409 from rxin/SPARK-13529.
It registers more Scala classes, including ListBuffer to support Kryo with FPGrowth.
See https://github.com/twitter/chill/releases for Chill's change log.
Author: mark800 <yky800@126.com>
Closes#11041 from mark800/master.
Phase 1: update plugin versions, test dependencies, some example and third-party versions
Author: Sean Owen <sowen@cloudera.com>
Closes#11206 from srowen/SPARK-13324.
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10608 from JoshRosen/SPARK-6363.
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1].
As required by the [design doc][2], spark-sketch should have no external dependency.
Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.
The following features will be added in future follow-up PRs:
- Serialization support
- DataFrame API integration
[1]: aac6b4d23a/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
[2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf
Author: Cheng Lian <lian@databricks.com>
Closes#10851 from liancheng/count-min-sketch.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
Include the following changes:
1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
3. Update the ActorWordCount example and add the JavaActorWordCount example
4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10744 from zsxwing/streaming-akka-2.
This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.
/cc rxin srowen
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10775 from JoshRosen/add-hadoop-2.7-profile.
The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL.
Author: BrianLondon <brian@seatgeek.com>
Closes#10256 from BrianLondon/master.
This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact.
While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at 3119206b71...5211ab8 (diff-600376dffeb79835ede4a0b285078036)
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10693 from JoshRosen/netty-hotfix.
Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse).
This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds.
/cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath.
Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10672 from JoshRosen/enforce-netty-exclusions.
This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository.
I tested this by running
```
build/mvn \
-Phive \
-Phive-thriftserver \
-Pkinesis-asl \
-Pspark-ganglia-lgpl \
-Pyarn \
dependency:go-offline
```
inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10659 from JoshRosen/SPARK-4628.
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)
See also https://github.com/apache/spark/pull/10512
Author: Sean Owen <sowen@cloudera.com>
Closes#10513 from srowen/SPARK-4819.
This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made:
The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling.
The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project:
- ```CatalystQl```: This implements Query and Expression parsing functionality.
- ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe.
- ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive.
cc rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#10583 from hvanhovell/SPARK-12575.
Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches. For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: 075c22e89b
The demo ran successfully on the 1.5 branch as well.
According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2.
Author: BrianLondon <brian@seatgeek.com>
Closes#10492 from BrianLondon/remove-only.
This PR inlines the Hive SQL parser in Spark SQL.
The previous (merged) incarnation of this PR passed all tests, but had and still has problems with the build. These problems are caused by a the fact that - for some reason - in some cases the ANTLR generated code is not included in the compilation fase.
This PR is a WIP and should not be merged until we have sorted out the build issues.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Closes#10525 from hvanhovell/SPARK-12362.
This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath.
This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs.
This patch is based on pwendell's work in #8531.
Closes#8531.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>
Closes#10461 from JoshRosen/SPARK-10359.
This is a WIP. The PR has been taken over from nongli (see https://github.com/apache/spark/pull/10420). I have removed some additional dead code, and fixed a few issues which were caused by the fact that the inlined Hive parser is newer than the Hive parser we currently use in Spark.
I am submitting this PR in order to get some feedback and testing done. There is quite a bit of work to do:
- [ ] Get it to pass jenkins build/test.
- [ ] Aknowledge Hive-project for using their parser.
- [ ] Refactorings between HiveQl and the java classes.
- [ ] Create our own ASTNode and integrate the current implicit extentions.
- [ ] Move remaining ```SemanticAnalyzer``` and ```ParseUtils``` functionality to ```HiveQl```.
- [ ] Removing Hive dependencies from the parser. This will require some edits in the grammar files.
- [ ] Introduce our own context which needs to contain a ```TokenRewriteStream```.
- [ ] Add ```useSQL11ReservedKeywordsForIdentifier``` and ```allowQuotedId``` to the catalyst or sql configuration.
- [ ] Remove ```HiveConf``` from grammar files &HiveQl, and pass in our own configuration.
- [ ] Moving the parser into sql/core.
cc nongli rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Closes#10509 from hvanhovell/SPARK-12362.
This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build.
Author: Mark Grover <mgrover@cloudera.com>
Closes#9876 from markgrover/master_docker.
Fix commons-collection group ID to commons-collections for version 3.x
Patches earlier PR at https://github.com/apache/spark/pull/9731
Author: Sean Owen <sowen@cloudera.com>
Closes#10198 from srowen/SPARK-11652.2.
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).
I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10112 from JoshRosen/upgrade-to-sbt-0.13.9.
This replaces https://github.com/apache/spark/pull/9696
Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.
Suggest fixing those TODOs in a separate PR(s).
More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
Also fix some of the minor violations that didn't require sweeping changes.
Apologies for the previous botched PRs - I finally figured out the issue.
cr: JoshRosen, pwendell
> I state that the contribution is my original work, and I license the work to the project under the project's open source license.
Author: Dmitry Erastov <derastov@gmail.com>
Closes#9867 from dskrvk/master.
This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9865 from JoshRosen/SPARK-4424.
Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability
Author: Sean Owen <sowen@cloudera.com>
Closes#9731 from srowen/SPARK-11652.
This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty.
This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 .
Author: Kent Yao <yaooqinn@hotmail.com>
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>
Closes#9746 from davies/roaring_mapstatus.
This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8.
In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml.
http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes.
I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9512 from JoshRosen/SPARK-6152.
This patch re-enables tests for the Docker JDBC data source. These tests were reverted in #4872 due to transitive dependency conflicts introduced by the `docker-client` library. This patch should avoid those problems by using a version of `docker-client` which shades its transitive dependencies and by performing some build-magic to work around problems with that shaded JAR.
In addition, I significantly refactored the tests to simplify the setup and teardown code and to fix several Docker networking issues which caused problems when running in `boot2docker`.
Closes#8101.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#9503 from JoshRosen/docker-jdbc-tests.
While the KCL handles de-aggregation during the regular operation, during recovery we use the lower level api, and therefore need to de-aggregate the records.
tdas Testing is an issue, we need protobuf magic to do the aggregated records. Maybe we could depend on KPL for tests?
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#9403 from brkyvz/kinesis-deaggregation.
Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9450 from JoshRosen/upgrade-to-scala-2.10.5.
This is an updated version of #8995 by a-roberts. Original description follows:
Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test.
Snappy 1.1.2 changelog mentions:
> snappy-java-1.1.2 (22 September 2015)
> This is a backward compatible release for 1.1.x.
> Add AIX (32-bit) support.
> There is no upgrade for the native libraries of the other platforms.
> A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s)
> snappy-java-1.1.2-RC2 (18 May 2015)
> Fix#107: SnappyOutputStream.close() is not idempotent
> snappy-java-1.1.2-RC1 (13 May 2015)
> SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream
> There has been no compressed format change since 1.0.5.x. So You can read the compressed results > interchangeablly between these versions.
> Fixes a problem when java.io.tmpdir does not exist.
Closes#8995.
Author: Adam Roberts <aroberts@uk.ibm.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9439 from JoshRosen/update-snappy.
It's a known issue that joda-time before 2.8.1 is incompatible with java 1.8u60 or later, which causes s3 request to fail. This affects Spark when using s3 as data source.
https://github.com/aws/aws-sdk-java/issues/444
Author: Yongjia Wang <yongjiaw@gmail.com>
Closes#9379 from yongjiaw/SPARK-11413.
JIRA: https://issues.apache.org/jira/browse/SPARK-11271
As reported in the JIRA ticket, when there are too many tasks, the memory usage of MapStatus will cause problem. Use BitSet instead of RoaringBitMap should be more efficient in memory usage.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9243 from viirya/mapstatus-bitset.
AWS SDK 1.9.40 is the latest 1.9.x release. KCL 1.5.1 is the latest release that using AWS SDK 1.9.x. The main goal is to have Kinesis consumer be able to read messages generated from Kinesis Producer Library (KPL). The API should be compatible with old versions.
tdas brkyvz
Author: Xiangrui Meng <meng@databricks.com>
Closes#9153 from mengxr/SPARK-11127.
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8437 from vanzin/test-tags.
This change aims at speeding up the dev cycle a little bit, by making
sure that all tests behave the same w.r.t. where the code to be tested
is loaded from. Namely, that means that tests don't rely on the assembly
anymore, rather loading all needed classes from the build directories.
The main change is to make sure all build directories (classes and test-classes)
are added to the classpath of child processes when running tests.
YarnClusterSuite required some custom code since the executors are run
differently (i.e. not through the launcher library, like standalone and
Mesos do).
I also found a couple of tests that could leak a SparkContext on failure,
and added code to handle those.
With this patch, it's possible to run the following command from a clean
source directory and have all tests pass:
mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7629 from vanzin/SPARK-9284.
Follow up to https://github.com/apache/spark/pull/7047
pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used.
CC trystanleftwich
Author: Sean Owen <sowen@cloudera.com>
Closes#8338 from srowen/SPARK-6196.
https://issues.apache.org/jira/browse/SPARK-9439
In general, Yarn apps should be robust to NodeManager restarts. However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor. (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us *find* those files.)
The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added. When running with yarn, that file is in the NM's local dir. Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there.
Nothing is changed in non-yarn modes with this patch. The service is not given a place to save the state to, so it operates the same as before. This should make it easy to update other cluster managers as well, by just supplying the right file & the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that.
Author: Imran Rashid <irashid@cloudera.com>
Closes#7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits:
0d285d3 [Imran Rashid] review feedback
70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart
5c71c8c [Imran Rashid] save executor to db before registering; style
2499c8c [Imran Rashid] explicit dependency on jackson-annotations
795d28f [Imran Rashid] review feedback
81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart
594d520 [Imran Rashid] use json to serialize application executor info
1a7980b [Imran Rashid] version
8267d2a [Imran Rashid] style
e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little
9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files
acedb62 [Imran Rashid] switch to writing out one record per executor
79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup
12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests
c878fbe [Imran Rashid] better explanation of shuffle service port handling
694934c [Imran Rashid] only open leveldb connection once per service
d596410 [Imran Rashid] store executor data in leveldb
59800b7 [Imran Rashid] Files.move in case renaming is unsupported
32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart
d7450f0 [Imran Rashid] style
f729e2b [Imran Rashid] debugging
4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks
0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart
55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests
245db19 [Imran Rashid] style
62586a6 [Imran Rashid] just serialize the whole executors map
bdbbf0d [Imran Rashid] comments, remove some unnecessary changes
857331a [Imran Rashid] better tests & comments
bb9d1e6 [Imran Rashid] formatting
bdc4b32 [Imran Rashid] rename
86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port
23994ff [Imran Rashid] style
7504de8 [Imran Rashid] style
a36729c [Imran Rashid] cleanup
efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart
dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts
d596969 [Imran Rashid] cleanup imports
0e9d69b [Imran Rashid] better names
9eae119 [Imran Rashid] cleanup lots of duplication
1136f44 [Imran Rashid] test needs to have an actual shuffle
0b588bd [Imran Rashid] more fixes ...
ad122ef [Imran Rashid] more fixes
5e5a7c3 [Imran Rashid] fix build
c69f46b [Imran Rashid] maybe working version, needs tests & cleanup ...
bb3ba49 [Imran Rashid] minor cleanup
36127d3 [Imran Rashid] wip
b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests
Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars.
Author: zsxwing <zsxwing@gmail.com>
Closes#8069 from zsxwing/SPARK-9574.
PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1.
When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`.
This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users.
So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile".
Author: Cheng Lian <lian@databricks.com>
Closes#8198 from liancheng/spark-9974/bundle-parquet-1.6.0.
The REST server is not actually used in most tests and so we can disable it. It is a source of flakiness because it tries to bind to a specific port in vain. There was also some code that avoided the shuffle service in tests. This is actually not necessary because the shuffle service is already off by default.
Author: Andrew Or <andrew@databricks.com>
Closes#8084 from andrewor14/fix-master-suite-again.
This PR is based on #4229, thanks prabeesh.
Closes#4229
Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>
Closes#7833 from zsxwing/pr4229 and squashes the following commits:
9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
This PR removes the dependency reduced POM hack brought back by #7191
Author: tedyu <yuzhihong@gmail.com>
Closes#7919 from tedyu/master and squashes the following commits:
1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
I'll explain several of the changes inline in comments.
Author: Sean Owen <sowen@cloudera.com>
Closes#7862 from srowen/SPARK-9534 and squashes the following commits:
ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork.
Tests not run yet: that's what the machines are for
Author: Steve Loughran <stevel@hortonworks.com>
Author: Cheng Lian <lian@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>
Closes#7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits:
7556d85 [Cheng Lian] Updates .q files and corresponding golden files
ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002
6a92bb0 [Cheng Lian] Overrides HiveConf time vars
dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe
0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header...
fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark
7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar
376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration
2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down
cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically.
6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import
da310dc [Michael Armbrust] Fixes for Hive tests.
a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete
7404f34 [Patrick Wendell] Add spark-hive staging repo
832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code
312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed
fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand"
c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first
4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests
314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests
17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly.
d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options
23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens
54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase
0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize
fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides
fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1
dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy
d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType
051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark
6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call
e6121e5 [Steve Loughran] SPARK-8064 address review comments
aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite
f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text
8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output.
5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue*
642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing
97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised.
335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log.
3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl
bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes
41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions
2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex
bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded
c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6
0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread
13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1
d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops
26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT
3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure
d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1
1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text
8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions
dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause.
463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output
2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec
1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec
75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port"
3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants
c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression?
27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings
00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now)
cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite
f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package
6c310b4 [Steve Loughran] SPARK-8064 subclass Hive ServerOptionsProcessor to make it public again
f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere
4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
Author: Sean Owen <sowen@cloudera.com>
Closes#7852 from srowen/SPARK-9521 and squashes the following commits:
3093039 [Sean Owen] Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here
See https://issues.apache.org/jira/browse/SPARK-8819
I verified that `mvn clean package -DskipTests` works with Maven 3.3.3.
pwendell are you up for trying this for the 1.5.0 release?
Author: Sean Owen <sowen@cloudera.com>
Closes#7826 from srowen/SPARK-9507 and squashes the following commits:
e0b0fd2 [Sean Owen] Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here
This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
Author: zsxwing <zsxwing@gmail.com>
Closes#6955 from zsxwing/kinesis-python and squashes the following commits:
e42e471 [zsxwing] Merge branch 'master' into kinesis-python
455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
5082d28 [zsxwing] Fix the syntax error for Python 2.6
fca416b [zsxwing] Fix wrong comparison
96670ff [zsxwing] Fix the compilation error after merging master
756a128 [zsxwing] Merge branch 'master' into kinesis-python
6c37395 [zsxwing] Print stack trace for debug
7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
cc9d071 [zsxwing] Fix the python test errors
466b425 [zsxwing] Add python tests for Kinesis
e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
3da2601 [zsxwing] Fix the kinesis folder
687446b [zsxwing] Fix the error message and the maven output path
add2beb [zsxwing] Merge branch 'master' into kinesis-python
4957c0b [zsxwing] Add the Python API for Kinesis
These commits address a few minor issues in the Scala cross-version support in the build:
1. Correct two missing `${scala.binary.version}` pom file substitutions.
2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.
This is my original work and I license this work to the Spark project under the Apache License.
Author: Michael Allman <michael@videoamp.com>
Closes#6832 from mallman/scala-versions and squashes the following commits:
cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}
We are running Spark 1.4.0 in production and ran into problems because after a network hiccup (which happens often in our current environment) no more metrics were reported to graphite leaving us blindfolded about the current state of our spark applications. [This problem](70559816f1) was fixed in the current version of the metrics library. We run spark with this change in production now and have seen no problems. We also had a look at the commit history since 3.1.0 and did not detect any potentially incompatible changes but many fixes which could potentially help other users as well.
Author: Carl Anders Düvel <c.a.duevel@gmail.com>
Closes#7493 from hackbert/bump-metrics-lib-version and squashes the following commits:
6677565 [Carl Anders Düvel] [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 in order to get this fix 70559816f1
Cleanup maven for a clean import in scala-ide / eclipse.
* remove groovy plugin which is really not needed at all
* add-source from build-helper-maven-plugin is not needed as recent version of scala-maven-plugin do it automatically
* add lifecycle-mapping plugin to hide a few useless warnings from ide
Author: Jan Prach <jendap@gmail.com>
Closes#7375 from jendap/clean-project-import-in-scala-ide and squashes the following commits:
c4b4c0f [Jan Prach] fix whitespaces
5a83e07 [Jan Prach] Revert "remove java compiler warnings from java tests"
312007e [Jan Prach] scala-maven-plugin itself add scala sources by default
f47d856 [Jan Prach] remove spark-1.4-staging repository
c8a54db [Jan Prach] remove java compiler warnings from java tests
999a068 [Jan Prach] remove some maven warnings in scala ide
80fbdc5 [Jan Prach] remove groovy and gmavenplus plugin
Replace Akka Serialization with Spark Serializer and add unit tests.
Author: zsxwing <zsxwing@gmail.com>
Closes#7159 from zsxwing/remove-akka-serialization and squashes the following commits:
fc0fca3 [zsxwing] Merge branch 'master' into remove-akka-serialization
cf81a58 [zsxwing] Fix the code style
73251c6 [zsxwing] Add test scope
9ef4af9 [zsxwing] Add AkkaRpcEndpointRef.hashCode
433115c [zsxwing] Remove final
be3edb0 [zsxwing] Support deserializing RpcEndpointRef
ecec410 [zsxwing] Replace Akka Serialization with Spark Serializer
Author: Hari Shreedharan <hshreedharan@apache.org>
Closes#6939 from harishreedharan/upgrade-flume-1.6.0 and squashes the following commits:
94b80ae [Hari Shreedharan] [SPARK-8533][Streaming] Upgrade Flume to 1.6.0
This PR removes most of the code in the Spark REPL for Scala 2.11 and leaves just a couple of overridden methods in `SparkILoop` in order to:
- change welcome message
- restrict available commands (like `:power`)
- initialize Spark context
The two codebases have diverged and it's extremely hard to backport fixes from the upstream REPL. This somewhat radical step is absolutely necessary in order to fix other REPL tickets (like SPARK-8013 - Hive Thrift server for 2.11). BTW, the Scala REPL has fixed the serialization-unfriendly wrappers thanks to ScrapCodes's work in [#4522](https://github.com/scala/scala/pull/4522)
All tests pass and I tried the `spark-shell` on our Mesos cluster with some simple jobs (including with additional jars), everything looked good.
As soon as Scala 2.11.7 is out we need to upgrade and get a shaded `jline` dependency, clearing the way for SPARK-8013.
/cc pwendell
Author: Iulian Dragos <jaguarul@gmail.com>
Closes#6903 from dragos/issue/no-spark-repl-fork and squashes the following commits:
c596c6f [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork
2b1a305 [Iulian Dragos] Removed spaces around multiple imports.
0ce67a6 [Iulian Dragos] Remove -verbose flag for java compiler (added by mistake in an earlier commit).
10edaf9 [Iulian Dragos] Keep the jline dependency only in the 2.10 build.
529293b [Iulian Dragos] Add back Spark REPL files to rat-excludes, since they are part of the 2.10 real.
d85370d [Iulian Dragos] Remove jline dependency from the Spark REPL.
b541930 [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork
2b15962 [Iulian Dragos] Change jline dependency and bump Scala version.
b300183 [Iulian Dragos] Rename package and add license on top of the file, remove files from rat-excludes and removed `-Yrepl-sync` per reviewer’s request.
9d46d85 [Iulian Dragos] Fix SPARK-7944.
abcc7cb [Iulian Dragos] Remove the REPL forked code.
Also, add support for the *-provided profiles. This avoids repackaging
things that are already in the Spark assembly, or, in the case of the
*-provided profiles, are provided by the distribution.
The flume-ng-auth dependency was also excluded since it's not really
used by Spark.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7247 from vanzin/SPARK-8852 and squashes the following commits:
298a7d5 [Marcelo Vanzin] Feedback.
c962082 [Marcelo Vanzin] [SPARK-8852] [flume] Trim dependencies in flume assembly.
These two dependencies were introduced in #7231 to help testing Parquet compatibility with `parquet-thrift`. However, they somehow crash the Scala compiler in Maven builds.
This PR fixes this issue by:
1. Removing these two dependencies, and
2. Instead of generating the testing Parquet file programmatically, checking in an actual testing Parquet file generated by `parquet-thrift` as a test resource.
This is just a quick fix to bring back Maven builds. Need to figure out the root case as binary Parquet files are harder to maintain.
Author: Cheng Lian <lian@databricks.com>
Closes#7330 from liancheng/spark-8959 and squashes the following commits:
cf69512 [Cheng Lian] Brings back Maven builds
`spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire.
```
<!-- Surefire runs all Java tests -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<!-- Note config is repeated in scalatest config -->
...
<spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak>
</systemProperties>
...
```
but is absent in the config ScalaTest.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#7308 from sarutak/add-setting-for-memory-leak and squashes the following commits:
95644e7 [Kousuke Saruta] Added a setting for memory leak
This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
### Major changes
1. `CatalystConverter` class hierarchy refactoring
- Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
This simplifies the design since converters don't need to care about details of their parent converters anymore.
- Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
- Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
`CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
- Implements backwards-compatibility rules in `CatalystArrayConverter`
When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
2. Requested columns handling
When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema.
In this PR, the schema for requested columns is constructed using the following method:
- For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
- For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
- Unions all single-field `MessageType`s into a full schema containing all requested fields
With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
### Testing
This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
[1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
[2]: https://issues.apache.org/jira/browse/SPARK-6774
[3]: https://issues.apache.org/jira/browse/SPARK-6123
[4]: https://issues.apache.org/jira/browse/SPARK-8848
Author: Cheng Lian <lian@databricks.com>
Closes#7231 from liancheng/spark-6776 and squashes the following commits:
360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
7946ee1 [Cheng Lian] Fixes Scala styling issues
3d7ab36 [Cheng Lian] Fixes .rat-excludes
a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
06cfe9d [Cheng Lian] Adds comments about TimestampType handling
a099d3e [Cheng Lian] More comments
0cc1b37 [Cheng Lian] Fixes MiMa checks
884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
38fe1e7 [Cheng Lian] Adds explicit return type
7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
1781dff [Cheng Lian] Adds test case for SPARK-8811
6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
a74fb2c [Cheng Lian] More comments
0525346 [Cheng Lian] Removes old Parquet record converters
03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
(This finishes the job by removing the version overridden by Hadoop profiles.)
See discussion at https://github.com/apache/spark/pull/6994#issuecomment-119113167
Author: Sean Owen <sowen@cloudera.com>
Closes#7261 from srowen/SPARK-6731.2 and squashes the following commits:
5a3f59e [Sean Owen] Finish updating Commons Math3 to 3.4.1 from 3.1.1
when publishing releases. We named it as 'release-profile' because that is
the Maven convention. However, it turns out this special name causes several
other things to kick-in when we are creating releases that are not desirable.
For instance, it triggers the javadoc plugin to run, which actually fails
in our current build set-up.
The fix is just to rename this to a different profile to have no
collateral damage associated with its use.
This is a workaround for MSHADE-148, which leads to an infinite loop when building Spark with maven 3.3.x. This was originally caused by #6441, which added a bunch of test dependencies on the spark-core test module. Recently, it was revealed by #7193.
This patch adds a `-Prelease` profile. If present, it will set `createDependencyReducedPom` to true. The consequences are:
- If you are releasing Spark with this profile, you are fine as long as you use maven 3.2.x or before.
- If you are releasing Spark without this profile, you will run into SPARK-8781.
- If you are not releasing Spark but you are using this profile, you may run into SPARK-8819.
- If you are not releasing Spark and you did not include this profile, you are fine.
This is all documented in `pom.xml` and tested locally with both versions of maven.
Author: Andrew Or <andrew@databricks.com>
Closes#7219 from andrewor14/fix-maven-build and squashes the following commits:
1d37e87 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-maven-build
3574ae4 [Andrew Or] Review comments
f39199c [Andrew Or] Create a -Prelease profile that flags `createDependencyReducedPom`
The issue is summarized in the JIRA and is caused by this commit: 984ad60147.
This patch reverts that commit and fixes the maven build in a different way. We limit the dependencies of `KinesisReceiverSuite` to avoid having to deal with the complexities in how maven deals with transitive test dependencies.
Author: Andrew Or <andrew@databricks.com>
Closes#7193 from andrewor14/fix-kinesis-pom and squashes the following commits:
ca3d5d4 [Andrew Or] Limit kinesis test dependencies
f24e09c [Andrew Or] Revert "[BUILD] Fix Maven build for Kinesis"
Author: zsxwing <zsxwing@gmail.com>
Closes#6830 from zsxwing/flume-python and squashes the following commits:
78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
This patch excludes `hadoop-client`'s dependency on `mockito-all`. As of #7061, Spark depends on `mockito-core` instead of `mockito-all`, so the dependency from Hadoop was leading to test compilation failures for some of the Hadoop 2 SBT builds.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7090 from JoshRosen/SPARK-8709 and squashes the following commits:
e190122 [Josh Rosen] [SPARK-8709] Exclude hadoop-client's mockito-all dependency.
PR #5694 reverted PR #6384 while refactoring `dev/run-tests` to `dev/run-tests.py`. Also, PR #6384 didn't bump Hadoop 1 version defined in POM.
Author: Cheng Lian <lian@databricks.com>
Closes#7062 from liancheng/spark-7845 and squashes the following commits:
c088b72 [Cheng Lian] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1
The previous commiter on this part was pwendell
The previous url gives 404, the new one seems to be OK.
This patch is added under the Apache License 2.0.
The JIRA link: https://issues.apache.org/jira/browse/SPARK-8649
Author: Thomas Szymanski <develop@tszymanski.com>
Closes#7054 from tszym/SPARK-8649 and squashes the following commits:
bfda9c4 [Thomas Szymanski] [SPARK-8649] [BUILD] Mapr repository is not defined properly
Spark's tests currently depend on `mockito-all`, which bundles Hamcrest and Objenesis classes. Instead, it should depend on `mockito-core`, which declares those libraries as Maven dependencies. This is necessary in order to fix a dependency conflict that leads to a NoSuchMethodError when using certain Hamcrest matchers.
See https://github.com/mockito/mockito/wiki/Declaring-mockito-dependency for more details.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7061 from JoshRosen/mockito-core-instead-of-all and squashes the following commits:
70eccbe [Josh Rosen] Depend on mockito-core instead of mockito-all.
This change is a simple one and specifies a stack size of 4096k instead of the vendor default for Java tests (the defaults vary between Java vendors). This remedies test failures observed with JavaALSSuite with IBM and Oracle Java owing to a lower default size in comparison to the size with OpenJDK. 4096k is a suitable default where the tests pass with each Java vendor tested. The alternative is to reduce the number of iterations in the test (no observed failures with 5 iterations instead of 15).
-Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).
I have ensured this does not have any negative implications for other tests.
Author: Adam Roberts <aroberts@uk.ibm.com>
Author: a-roberts <aroberts@uk.ibm.com>
Closes#6727 from a-roberts/IncJavaStackSize and squashes the following commits:
ab40aea [Adam Roberts] Specify stack size for SBT builds
5032d8d [a-roberts] Update pom.xml
Update to Netty 4.0.28-Final
Author: Sean Owen <sowen@cloudera.com>
Closes#6701 from srowen/SPARK-8101 and squashes the following commits:
f3b6369 [Sean Owen] Update to Netty 4.0.28-Final
Even with all the efforts to cleanup the temp directories created by
unit tests, Spark leaves a lot of garbage in /tmp after a test run.
This change overrides java.io.tmpdir to place those files under the
build directory instead.
After an sbt full unit test run, I was left with > 400 MB of temp
files. Since they're now under the build dir, it's much easier to
clean them up.
Also make a slight change to a unit test to make it not pollute the
source directory with test data.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6674 from vanzin/SPARK-8126 and squashes the following commits:
0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run.
643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build.
Update build to use Java 7, and remove some comments and special-case support for Java 6.
Author: Sean Owen <sowen@cloudera.com>
Closes#6265 from srowen/SPARK-7733 and squashes the following commits:
59bda4e [Sean Owen] Update build to use Java 7, and remove some comments and special-case support for Java 6
Both akka 2.3.x and hadoop-2.x use protobuf 2.5 so only hadoop-1 build needs
custom 2.3.4-spark akka version that shades protobuf-2.5
This change also updates akka version (for hadoop-2.x profiles only) to the
latest 2.3.11 as akka-zeromq_2.11 is not available for akka 2.3.4.
This partially fixes SPARK-7042 (for hadoop-2.x builds)
Author: Konstantin Shaposhnikov <Konstantin.Shaposhnikov@sc.com>
Closes#6492 from kostya-sh/SPARK-7042 and squashes the following commits:
dc195b0 [Konstantin Shaposhnikov] [SPARK-7042] [BUILD] use the standard akka artifacts with hadoop-2.x