Commit graph

570 commits

Author SHA1 Message Date
Josh Rosen f44c8a843c [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT
This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.
2017-04-24 21:48:04 -07:00
Yash Sharma ec68d8f8cf [SPARK-20189][DSTREAM] Fix spark kinesis testcases to remove deprecated createStream and use Builders
## What changes were proposed in this pull request?

The spark-kinesis testcases use the KinesisUtils.createStream which are deprecated now. Modify the testcases to use the recommended KinesisInputDStream.builder instead.
This change will also enable the testcases to automatically use the session tokens automatically.

## How was this patch tested?

All the existing testcases work fine as expected with the changes.

https://issues.apache.org/jira/browse/SPARK-20189

Author: Yash Sharma <ysharma@atlassian.com>

Closes #17506 from yssharma/ysharma/cleanup_kinesis_testcases.
2017-04-13 08:49:19 +01:00
Sean Owen a26e3ed5e4 [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems
## What changes were proposed in this pull request?

Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").

The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #17527 from srowen/SPARK-20156.
2017-04-10 20:11:56 +01:00
Tathagata Das dad499f324 [SPARK-20209][SS] Execute next trigger immediately if previous batch took longer than trigger interval
## What changes were proposed in this pull request?

For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval.

In this PR, I modified the ProcessingTimeExecutor to do so. Another minor change I did was to extract our StreamManualClock into a separate class so that it can be used outside subclasses of StreamTest. For example, ProcessingTimeExecutorSuite does not need to create any context for testing, just needs the StreamManualClock.

## How was this patch tested?
Added new unit tests to comprehensively test this behavior.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17525 from tdas/SPARK-20209.
2017-04-04 23:20:17 -07:00
Adam Budde 707e501832 [SPARK-19911][STREAMING] Add builder interface for Kinesis DStreams
## What changes were proposed in this pull request?

- Add new KinesisDStream.scala containing KinesisDStream.Builder class
- Add KinesisDStreamBuilderSuite test suite
- Make KinesisInputDStream ctor args package private for testing
- Add JavaKinesisDStreamBuilderSuite test suite
- Add args to KinesisInputDStream and KinesisReceiver for optional
  service-specific auth (Kinesis, DynamoDB and CloudWatch)
## How was this patch tested?

Added ```KinesisDStreamBuilderSuite``` to verify builder class works as expected

Author: Adam Budde <budde@amazon.com>

Closes #17250 from budde/KinesisStreamBuilder.
2017-03-24 12:40:29 -07:00
Tyson Condie 746a558de2 [SPARK-19876][SS][WIP] OneTime Trigger Executor
## What changes were proposed in this pull request?

An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.

In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.

## How was this patch tested?

A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.

In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
- The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
- The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
- A OneTime trigger execution that results in an exception being thrown.

marmbrus tdas zsxwing

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17219 from tcondie/stream-commit.
2017-03-23 14:32:05 -07:00
Liwei Lin 2ea214dd05 [SPARK-19721][SS] Good error message for version mismatch in log files
## Problem

There are several places where we write out version identifiers in various logs for structured streaming (usually `v1`). However, in the places where we check for this, we throw a confusing error message.

## What changes were proposed in this pull request?

This patch made two major changes:
1. added a `parseVersion(...)` method, and based on this method, fixed the following places the way they did version checking (no other place needed to do this checking):
```
HDFSMetadataLog
  - CompactibleFileStreamLog  ------------> fixed with this patch
    - FileStreamSourceLog  ---------------> inherited the fix of `CompactibleFileStreamLog`
    - FileStreamSinkLog  -----------------> inherited the fix of `CompactibleFileStreamLog`
  - OffsetSeqLog  ------------------------> fixed with this patch
  - anonymous subclass in KafkaSource  ---> fixed with this patch
```

2. changed the type of `FileStreamSinkLog.VERSION`, `FileStreamSourceLog.VERSION` etc. from `String` to `Int`, so that we can identify newer versions via `version > 1` instead of `version != "v1"`
    - note this didn't break any backwards compatibility -- we are still writing out `"v1"` and reading back `"v1"`

## Exception message with this patch
```
java.lang.IllegalStateException: Failed to read log file /private/var/folders/nn/82rmvkk568sd8p3p8tb33trw0000gn/T/spark-86867b65-0069-4ef1-b0eb-d8bd258ff5b8/0. UnsupportedLogVersion: maximum supported log version is v1, but encountered v99. The log file was produced by a newer version of Spark and cannot be read by this version. Please upgrade.
	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:202)
	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:78)
	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:75)
	at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:133)
	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite.withTempDir(OffsetSeqLogSuite.scala:26)
	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply$mcV$sp(OffsetSeqLogSuite.scala:75)
	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
```

## How was this patch tested?

unit tests

Author: Liwei Lin <lwlin7@gmail.com>

Closes #17070 from lw-lin/better-msg.
2017-03-16 13:05:36 -07:00
uncleGen 0a4d06a7c3 [SPARK-19853][SS] uppercase kafka topics fail when startingOffsets are SpecificOffsets
When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer.

KafkaSourceProvider.scala:
```
val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match {
    case Some("latest") => LatestOffsets
    case Some("earliest") => EarliestOffsets
    case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
    case None => LatestOffsets
  }
```

Thank cbowden for reporting.

Jenkins

Author: uncleGen <hustyugm@gmail.com>

Closes #17209 from uncleGen/SPARK-19853.
2017-03-12 17:52:20 -07:00
Burak Yavuz 82138e09b9 [SPARK-19886] Fix reportDataLoss if statement in SS KafkaSource
## What changes were proposed in this pull request?

Fix the `throw new IllegalStateException` if statement part.

## How is this patch tested

Regression test

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #17228 from brkyvz/kafka-cause-fix.
2017-03-09 17:42:10 -08:00
Tyson Condie b0a5cd8909 [SPARK-19719][SS] Kafka writer for both structured streaming and batch queires
## What changes were proposed in this pull request?

Add a new Kafka Sink and Kafka Relation for writing streaming and batch queries, respectively, to Apache Kafka.
### Streaming Kafka Sink
- When addBatch is called
-- If batchId is great than the last written batch
--- Write batch to Kafka
---- Topic will be taken from the record, if present, or from a topic option, which overrides topic in record.
-- Else ignore

### Batch Kafka Sink
- KafkaSourceProvider will implement CreatableRelationProvider
- CreatableRelationProvider#createRelation will write the passed in Dataframe to a Kafka
- Topic will be taken from the record, if present, or from topic option, which overrides topic in record.
- Save modes Append and ErrorIfExist supported under identical semantics. Other save modes result in an AnalysisException

tdas zsxwing

## How was this patch tested?

### The following unit tests will be included
- write to stream with topic field: valid stream write with data that includes an existing topic in the schema
- write structured streaming aggregation w/o topic field, with default topic: valid stream write with data that does not include a topic field, but the configuration includes a default topic
- write data with bad schema: various cases of writing data that does not conform to a proper schema e.g., 1. no topic field or default topic, and 2. no value field
- write data with valid schema but wrong types: data with a complete schema but wrong types e.g., key and value types are integers.
- write to non-existing topic: write a stream to a topic that does not exist in Kafka, which has been configured to not auto-create topics.
- write batch to kafka: simple write batch to Kafka, which goes through the same code path as streaming scenario, so validity checks will not be redone here.

### Examples
```scala
// Structured Streaming
val writer = inputStringStream.map(s => s.get(0).toString.getBytes()).toDF("value")
 .selectExpr("value as key", "value as value")
 .writeStream
 .format("kafka")
 .option("checkpointLocation", checkpointDir)
 .outputMode(OutputMode.Append)
 .option("kafka.bootstrap.servers", brokerAddress)
 .option("topic", topic)
 .queryName("kafkaStream")
 .start()

// Batch
val df = spark
 .sparkContext
 .parallelize(Seq("1", "2", "3", "4", "5"))
 .map(v => (topic, v))
 .toDF("topic", "value")

df.write
 .format("kafka")
 .option("kafka.bootstrap.servers",brokerAddress)
 .option("topic", topic)
 .save()
```
Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #17043 from tcondie/kafka-writer.
2017-03-06 16:39:05 -08:00
Gaurav 46a64d1e0a [SPARK-19304][STREAMING][KINESIS] fix kinesis slow checkpoint recovery
## What changes were proposed in this pull request?
added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce the amount of data returned by kinesis api call making the recovery considerably faster

As we are storing the `fromSeqNum` & `toSeqNum` in checkpoint metadata, we can also store the number of records. Which can later be used for api call.

## How was this patch tested?
The patch was manually tested

Apologies for any silly mistakes, opening first pull request

Author: Gaurav <gaurav@techtinium.com>

Closes #16842 from Gauravshah/kinesis_checkpoint_recovery_fix_2_1_0.
2017-03-06 10:41:49 -08:00
hyukjinkwon 4ba9c6c453 [MINOR][BUILD] Fix lint-java breaks in Java
## What changes were proposed in this pull request?

This PR proposes to fix the lint-breaks as below:

```
[ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer.
[ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers.
[ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed.
[ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121).
[ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122).
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time.
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114).
[ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1.
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD.
[ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext.
```

## How was this patch tested?

Manually via

```bash
./dev/lint-java
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17072 from HyukjinKwon/java-lint.
2017-02-27 08:44:26 +00:00
Adam Budde e4065376d2 [SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS
- Add dependency on aws-java-sdk-sts
- Replace SerializableAWSCredentials with new SerializableCredentialsProvider interface
- Make KinesisReceiver take SerializableCredentialsProvider as argument and
  pass credential provider to KCL
- Add new implementations of KinesisUtils.createStream() that take STS
  arguments
- Make JavaKinesisStreamSuite test the entire KinesisUtils Java API
- Update KCL/AWS SDK dependencies to 1.7.x/1.11.x

## What changes were proposed in this pull request?

[JIRA link with detailed description.](https://issues.apache.org/jira/browse/SPARK-19405)

* Replace SerializableAWSCredentials with new SerializableKCLAuthProvider class that takes 5 optional config params for configuring AWS auth and returns the appropriate credential provider object
* Add new public createStream() APIs for specifying these parameters in KinesisUtils

## How was this patch tested?

* Manually tested using explicit keypair and instance profile to read data from Kinesis stream in separate account (difficult to write a test orchestrating creation and assumption of IAM roles across separate accounts)
* Expanded JavaKinesisStreamSuite to test the entire Java API in KinesisUtils

## License acknowledgement
This contribution is my original work and that I license the work to the project under the project’s open source license.

Author: Budde <budde@amazon.com>

Closes #16744 from budde/master.
2017-02-22 11:32:36 -05:00
hyukjinkwon 17b93b5feb
[SPARK-18922][TESTS] Fix new test failures on Windows due to path and resource not closed
## What changes were proposed in this pull request?

This PR proposes to fix new test failures on WIndows as below:

**Before**

```
KafkaRelationSuite:
 - test late binding start offsets *** FAILED *** (7 seconds, 679 milliseconds)
   Cause: java.nio.file.FileSystemException: C:\projects\spark\target\tmp\spark-4c4b0cd1-4cb7-4908-949d-1b0cc8addb50\topic-4-0\00000000000000000000.log -> C:\projects\spark\target\tmp\spark-4c4b0cd1-4cb7-4908-949d-1b0cc8addb50\topic-4-0\00000000000000000000.log.deleted: The process cannot access the file because it is being used by another process.

KafkaSourceSuite:
 - deserialization of initial offset with Spark 2.1.0 *** FAILED *** (3 seconds, 542 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-97ef64fc-ae61-4ce3-ac59-287fd38bd824

 - deserialization of initial offset written by Spark 2.1.0 *** FAILED *** (60 milliseconds)
   java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/external/kafka-0-10-sql/target/scala-2.11/test-classes/kafka-source-initial-offset-version-2.1.0.b

HiveDDLSuite:
 - partitioned table should always put partition columns at the end of table schema *** FAILED *** (657 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-f1b83d09-850a-4bba-8e43-a2a28dfaa757;

DDLSuite:
 - create a data source table without schema *** FAILED *** (94 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-a3f3c161-afae-4d6f-9182-e8642f77062b;

 - SET LOCATION for managed table *** FAILED *** (219 milliseconds)
   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
 Exchange SinglePartit
 +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#99367L])
    +- *FileScan parquet default.tbl[] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:projectsspark	arget	mpspark-15be2f2f-4ea9-4c47-bfee-1b7b49363033], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

 - insert data to a data source table which has a not existed location should succeed *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-34987671-e8d1-4624-ba5b-db1012e1246b;

 - insert into a data source table with no existed partition location should succeed *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4c6ccfbf-4091-4032-9fbc-3d40c58267d5;

 - read data from a data source table which has a not existed location should succeed *** FAILED *** (0 milliseconds)

 - read data from a data source table with no existed partition location should succeed *** FAILED *** (0 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-6af39e37-abd1-44e8-ac68-e2dfcf67a2f3;

InputOutputMetricsSuite:
 - output metrics on records written *** FAILED *** (0 milliseconds)
   java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-cd69ee77-88f2-4202-bed6-19c0ee05ef55\InputOutputMetricsSuite, expected: file:///

 - output metrics on records written - new Hadoop API *** FAILED *** (16 milliseconds)
   java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-b69e8fcb-047b-4de8-9cdf-5f026efb6762\InputOutputMetricsSuite, expected: file:///
```

**After**

```
KafkaRelationSuite:
 - test late binding start offsets !!! CANCELED !!! (62 milliseconds)

KafkaSourceSuite:
 - deserialization of initial offset with Spark 2.1.0 (5 seconds, 341 milliseconds)
 - deserialization of initial offset written by Spark 2.1.0 (910 milliseconds)

HiveDDLSuite:
 - partitioned table should always put partition columns at the end of table schema (2 seconds)

DDLSuite:
 - create a data source table without schema (828 milliseconds)
 - SET LOCATION for managed table (406 milliseconds)
 - insert data to a data source table which has a not existed location should succeed (406 milliseconds)
 - insert into a data source table with no existed partition location should succeed (453 milliseconds)
 - read data from a data source table which has a not existed location should succeed (94 milliseconds)
 - read data from a data source table with no existed partition location should succeed (265 milliseconds)

InputOutputMetricsSuite:
 - output metrics on records written (172 milliseconds)
 - output metrics on records written - new Hadoop API (297 milliseconds)
```

## How was this patch tested?

Fixed tests in `InputOutputMetricsSuite`, `KafkaRelationSuite`,  `KafkaSourceSuite`, `DDLSuite.scala` and `HiveDDLSuite`.

Manually tested via AppVeyor as below:

`InputOutputMetricsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/633-20170219-windows-test/job/ex8nvwa6tsh7rmto
`KafkaRelationSuite`: https://ci.appveyor.com/project/spark-test/spark/build/633-20170219-windows-test/job/h8dlcowew52y8ncw
`KafkaSourceSuite`: https://ci.appveyor.com/project/spark-test/spark/build/634-20170219-windows-test/job/9ybgjl7yeubxcre4
`DDLSuite`: https://ci.appveyor.com/project/spark-test/spark/build/635-20170219-windows-test
`HiveDDLSuite`: https://ci.appveyor.com/project/spark-test/spark/build/633-20170219-windows-test/job/up6o9n47er087ltb

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16999 from HyukjinKwon/windows-fix.
2017-02-20 21:26:54 -08:00
Sean Owen 1487c9af20
[SPARK-19534][TESTS] Convert Java tests to use lambdas, Java 8 features
## What changes were proposed in this pull request?

Convert tests to use Java 8 lambdas, and modest related fixes to surrounding code.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16964 from srowen/SPARK-19534.
2017-02-19 09:42:50 -08:00
Shixiong Zhu 15b144d2bf [SPARK-19617][SS] Fix the race condition when starting and stopping a query quickly
## What changes were proposed in this pull request?

The streaming thread in StreamExecution uses the following ways to check if it should exit:
- Catch an InterruptException.
- `StreamExecution.state` is TERMINATED.

When starting and stopping a query quickly, the above two checks may both fail:
- Hit [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) and swallow InterruptException
- StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches](dcc2d540a5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala (L252)) changes the state from `TERMINATED` to `ACTIVE`.

If the above cases both happen, the query will hang forever.

This PR changes `state` to `AtomicReference` and uses`compareAndSet` to make sure we only change the state from `INITIALIZING` to `ACTIVE`. It also removes the `runUninterruptibly` hack from ``HDFSMetadata`, because HADOOP-14084 won't cause any problem after we fix the race condition.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16947 from zsxwing/SPARK-19617.
2017-02-17 19:04:45 -08:00
Roberto Agostino Vitillo 1a3f5f8c55 [SPARK-19517][SS] KafkaSource fails to initialize partition offsets
## What changes were proposed in this pull request?

This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets.

## How was this patch tested?

I ran the test suite for spark-sql-kafka-0-10.

Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>

Closes #16857 from vitillo/kafka_source_fix.
2017-02-17 11:44:18 -08:00
Sean Owen 0e2405490f
[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support
- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.
2017-02-16 12:32:45 +00:00
sureshthalamati f48c5a57d6 [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.
## What changes were proposed in this pull request?
The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner.

This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case.

This PR  enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection.

Alternative approach PR https://github.com/apache/spark/pull/16847  is to pass original input keys to JDBC data source by adding check in the  Data source class and handle case-insensitivity in the JDBC source code.

## How was this patch tested?
Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
2017-02-14 15:34:12 -08:00
Liwei Lin 2bdbc87052 [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group
## What changes were proposed in this pull request?

In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**

The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.

We should make sure that `groupId` and `nextId` are initialized before any consumer is created.

## How was this patch tested?

Ran 100 times of `KafkaSourceSuite`; all passed

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16902 from lw-lin/SPARK-19564-.
2017-02-12 23:00:22 -08:00
Tyson Condie 8df4444034 [SPARK-18682][SS] Batch Source for Kafka
## What changes were proposed in this pull request?

Today, you can start a stream that reads from kafka. However, given kafka's configurable retention period, it seems like sometimes you might just want to read all of the data that is available now. As such we should add a version that works with spark.read as well.
The options should be the same as the streaming kafka source, with the following differences:
startingOffsets should default to earliest, and should not allow latest (which would always be empty).
endingOffsets should also be allowed and should default to latest. the same assign json format as startingOffsets should also be accepted.
It would be really good, if things like .limit(n) were enough to prevent all the data from being read (this might just work).

## How was this patch tested?

KafkaRelationSuite was added for testing batch queries via KafkaUtils.

Author: Tyson Condie <tcondie@gmail.com>

Closes #16686 from tcondie/SPARK-18682.
2017-02-07 14:31:23 -08:00
Takeshi YAMAMURO 256a3a8013 [SPARK-18020][STREAMING][KINESIS] Checkpoint SHARD_END to finish reading closed shards
## What changes were proposed in this pull request?
This pr is to fix an issue occurred when resharding Kinesis streams; the resharding makes the KCL throw an exception because Spark does not checkpoint `SHARD_END` when finishing reading closed shards in `KinesisRecordProcessor#shutdown`. This bug finally leads to stopping subscribing new split (or merged) shards.

## How was this patch tested?
Added a test in `KinesisStreamSuite` to check if it works well when splitting/merging shards.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #16213 from maropu/SPARK-18020.
2017-01-25 17:38:48 -08:00
sureshthalamati f174cdc747 [SPARK-14536][SQL] fix to handle null value in array type column for postgres.
## What changes were proposed in this pull request?

JDBC  read  is failing with  NPE due to missing null value check for array data type if the source table has null values in the array type column.  For null values Resultset.getArray()  returns null.
This PR adds null safe check to the Resultset.getArray() value before invoking method on the Array object.
## How was this patch tested?

Updated the PostgresIntegration test suite to test null values. Ran docker integration tests on my laptop.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #15192 from sureshthalamati/jdbc_array_null_fix-SPARK-14536.
2017-01-20 19:23:20 -08:00
uncleGen eefdf9f9dd
[SPARK-19227][SPARK-19251] remove unused imports and outdated comments
## What changes were proposed in this pull request?
remove ununsed imports and outdated comments, and fix some minor code style issue.

## How was this patch tested?
existing ut

Author: uncleGen <hustyugm@gmail.com>

Closes #16591 from uncleGen/SPARK-19227.
2017-01-18 09:44:32 +00:00
uncleGen a5e651f4c6
[SPARK-19206][DOC][DSTREAM] Fix outdated parameter descriptions in kafka010
## What changes were proposed in this pull request?

Fix outdated parameter descriptions in kafka010

## How was this patch tested?

cc koeninger  zsxwing

Author: uncleGen <hustyugm@gmail.com>

Closes #16569 from uncleGen/SPARK-19206.
2017-01-15 11:16:49 +00:00
hyukjinkwon 4e27578faa
[SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due to path and resource-not-closed problems on Windows
## What changes were proposed in this pull request?

This PR proposes to fix all the test failures identified by testing with AppVeyor.

**Scala - aborted tests**

```
WindowQuerySuite:
  Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.execution.WindowQuerySuite *** ABORTED *** (156 milliseconds)
   org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive   argetscala-2.11   est-classesdatafilespart_tiny.txt;

OrcSourceSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.orc.OrcSourceSuite *** ABORTED *** (62 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

ParquetMetastoreSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetMetastoreSuite *** ABORTED *** (4 seconds, 703 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

ParquetSourceSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetSourceSuite *** ABORTED *** (3 seconds, 907 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-581a6575-454f-4f21-a516-a07f95266143;

KafkaRDDSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaRDDSuite *** ABORTED *** (5 seconds, 212 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-4722304d-213e-4296-b556-951df1a46807

DirectKafkaStreamSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite *** ABORTED *** (7 seconds, 127 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d0d3eba7-4215-4e10-b40e-bb797e89338e
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)

ReliableKafkaStreamSuite
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite *** ABORTED *** (5 seconds, 498 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d33e45a0-287e-4bed-acae-ca809a89d888

KafkaStreamSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaStreamSuite *** ABORTED *** (2 seconds, 892 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-59c9d169-5a56-4519-9ef0-cefdbd3f2e6c

KafkaClusterSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaClusterSuite *** ABORTED *** (1 second, 690 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-3ef402b0-8689-4a60-85ae-e41e274f179d

DirectKafkaStreamSuite:
 Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite *** ABORTED *** (59 seconds, 626 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6

KafkaRDDSuite:
Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.KafkaRDDSuite *** ABORTED *** (2 minutes, 6 seconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-b9ce7929-5dae-46ab-a0c4-9ef6f58fbc2
```

**Java - failed tests**

```
Test org.apache.spark.streaming.kafka.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-1cee32f4-4390-4321-82c9-e8616b3f0fb0, took 9.61 sec

Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-f42695dd-242e-4b07-847c-f299b8e4676e, took 11.797 sec

Test org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-85c0d062-78cf-459c-a2dd-7973572101ce, took 1.581 sec

Test org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-49eb6b5c-8366-47a6-83f2-80c443c48280, took 17.895 sec

org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-898cf826-d636-4b1c-a61a-c12a364c02e7, took 8.858 sec
```

**Scala - failed tests**

```
PartitionProviderCompatibilitySuite:
 - insert overwrite partition of new datasource table overwrites just partition *** FAILED *** (828 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-bb6337b9-4f99-45ab-ad2c-a787ab965c09

 - SPARK-18635 special chars in partition values - partition management true *** FAILED *** (5 seconds, 360 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - SPARK-18635 special chars in partition values - partition management false *** FAILED *** (141 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```

```
UtilsSuite:
 - reading offset bytes of a file (compressed) *** FAILED *** (0 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ecb2b7d5-db8b-43a7-b268-1bf242b5a491

 - reading offset bytes across multiple files (compressed) *** FAILED *** (0 milliseconds)
   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-25cc47a8-1faa-4da5-8862-cf174df63ce0
```

```
StatisticsSuite:
 - MetastoreRelations fallback to HDFS for size estimation *** FAILED *** (110 milliseconds)
   org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'csv_table' not found in database 'default';
```

```
SQLQuerySuite:
 - permanent UDTF *** FAILED *** (125 milliseconds)
   org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count_temp'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24

 - describe functions - user defined functions *** FAILED *** (125 milliseconds)
   org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7

 - CTAS without serde with location *** FAILED *** (16 milliseconds)
   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-ed673d73-edfc-404e-829e-2e2b9725d94e/c1

 - derived from Hive query file: drop_database_removes_partition_dirs.q *** FAILED *** (47 milliseconds)
   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_database_removes_partition_dirs_table

 - derived from Hive query file: drop_table_removes_partition_dirs.q *** FAILED *** (0 milliseconds)
   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_table_removes_partition_dirs_table2

 - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH *** FAILED *** (109 milliseconds)
   java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/sql/hive/projectsspark	arget	mpspark-1a122f8c-dfb3-46c4-bab1-f30764baee0e/*part-r*
```

```
HiveDDLSuite:
 - drop external tables in default database *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - add/drop partitions - external table *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - create/drop database - location without pre-created directory *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - create/drop database - location with pre-created directory *** FAILED *** (32 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - drop database containing tables - CASCADE *** FAILED *** (94 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - drop an empty database - CASCADE *** FAILED *** (63 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - drop database containing tables - RESTRICT *** FAILED *** (47 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - drop an empty database - RESTRICT *** FAILED *** (47 milliseconds)
   CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)

 - CREATE TABLE LIKE an external data source table *** FAILED *** (140 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-c5eba16d-07ae-4186-95bb-21c5811cf888;

 - CREATE TABLE LIKE an external Hive serde table *** FAILED *** (16 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - desc table for data source table - no user-defined schema *** FAILED *** (125 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e8bf5bf5-721a-4cbe-9d6	at scala.collection.immutable.List.foreach(List.scala:381)d-5543a8301c1d;
```

```
MetastoreDataSourcesSuite
 - CTAS: persisted bucketed data source table *** FAILED *** (16 milliseconds)
   java.lang.IllegalArgumentException: Can not create a Path from an empty string
```

```
ShowCreateTableSuite:
 - simple external hive table *** FAILED *** (0 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```

```
PartitionedTablePerfStatsSuite:
 - hive table: partitioned pruned table reports only selected files *** FAILED *** (313 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: partitioned pruned table reports only selected files *** FAILED *** (219 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-311f45f8-d064-4023-a4bb-e28235bff64d;

 - hive table: lazy partition pruning reads only necessary partition data *** FAILED *** (203 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: lazy partition pruning reads only necessary partition data *** FAILED *** (187 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-fde874ca-66bd-4d0b-a40f-a043b65bf957;

 - hive table: lazy partition pruning with file status caching enabled *** FAILED *** (188 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: lazy partition pruning with file status caching enabled *** FAILED *** (187 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e6d20183-dd68-4145-acbe-4a509849accd;

 - hive table: file status caching respects refresh table and refreshByPath *** FAILED *** (172 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: file status caching respects refresh table and refreshByPath *** FAILED *** (203 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-8b2c9651-2adf-4d58-874f-659007e21463;

 - hive table: file status cache respects size limit *** FAILED *** (219 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: file status cache respects size limit *** FAILED *** (171 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-7835ab57-cb48-4d2c-bb1d-b46d5a4c47e4;

 - datasource table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-20598d76-c004-42a7-8061-6c56f0eda5e2;

 - hive table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - hive table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 281 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 422 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4cfed321-4d1d-4b48-8d34-5c169afff383;

 - hive table: files read and cached when filesource partition management is off *** FAILED *** (234 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);

 - datasource table: all partition data cached in memory when partition management is off *** FAILED *** (203 milliseconds)
   org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4bcc0398-15c9-4f6a-811e-12d40f3eec12;

 - SPARK-18700: table loaded only once even when resolved concurrently *** FAILED *** (1 second, 266 milliseconds)
   org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```

```
HiveSparkSubmitSuite:
 - temporary Hive UDF: define a UDF and use it *** FAILED *** (2 seconds, 94 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - permanent Hive UDF: define a UDF and use it *** FAILED *** (281 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - permanent Hive UDF: use a already defined permanent function *** FAILED *** (718 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-8368: includes jars passed in through --jars *** FAILED *** (3 seconds, 521 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-8020: set sql conf in spark conf *** FAILED *** (0 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-8489: MissingRequirementError during reflection *** FAILED *** (94 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-9757 Persist Parquet relation with decimal column *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-11009 fix wrong result of Window function in cluster mode *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-14244 fix window partition size attribute binding failure *** FAILED *** (78 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - set spark.sql.warehouse.dir *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - set hive.metastore.warehouse.dir *** FAILED *** (15 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-16901: set javax.jdo.option.ConnectionURL *** FAILED *** (16 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified

 - SPARK-18360: default table path of tables in default database should depend on the location of default database *** FAILED *** (15 milliseconds)
   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
```

```
UtilsSuite:
 - resolveURIs with multiple paths *** FAILED *** (0 milliseconds)
   ".../jar3,file:/C:/pi.py[%23]py.pi,file:/C:/path%..." did not equal ".../jar3,file:/C:/pi.py[#]py.pi,file:/C:/path%..." (UtilsSuite.scala:468)
```

```
CheckpointSuite:
 - recovery with file input stream *** FAILED *** (10 seconds, 205 milliseconds)
   The code passed to eventually never returned normally. Attempted 660 times over 10.014272499999999 seconds. Last failure message: Unexpected internal error near index 1
   \
    ^. (CheckpointSuite.scala:680)
```

## How was this patch tested?

Manually via AppVeyor as below:

**Scala - aborted tests**

```
WindowQuerySuite - all passed
OrcSourceSuite:
- SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
  org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
ParquetMetastoreSuite - all passed
ParquetSourceSuite - all passed
KafkaRDDSuite - all passed
DirectKafkaStreamSuite - all passed
ReliableKafkaStreamSuite - all passed
KafkaStreamSuite - all passed
KafkaClusterSuite - all passed
DirectKafkaStreamSuite - all passed
KafkaRDDSuite - all passed
```

**Java - failed tests**

```
org.apache.spark.streaming.kafka.JavaKafkaRDDSuite - all passed
org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite - all passed
org.apache.spark.streaming.kafka.JavaKafkaStreamSuite - all passed
org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite - all passed
org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite - all passed
```

**Scala - failed tests**

```
PartitionProviderCompatibilitySuite:
- insert overwrite partition of new datasource table overwrites just partition (1 second, 953 milliseconds)
- SPARK-18635 special chars in partition values - partition management true (6 seconds, 31 milliseconds)
- SPARK-18635 special chars in partition values - partition management false (4 seconds, 578 milliseconds)
```

```
UtilsSuite:
- reading offset bytes of a file (compressed) (203 milliseconds)
- reading offset bytes across multiple files (compressed) (0 milliseconds)
```

```
StatisticsSuite:
- MetastoreRelations fallback to HDFS for size estimation (94 milliseconds)
```

```
SQLQuerySuite:
 - permanent UDTF (407 milliseconds)
 - describe functions - user defined functions (441 milliseconds)
 - CTAS without serde with location (2 seconds, 831 milliseconds)
 - derived from Hive query file: drop_database_removes_partition_dirs.q (734 milliseconds)
 - derived from Hive query file: drop_table_removes_partition_dirs.q (563 milliseconds)
 - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH (453 milliseconds)
```

```
HiveDDLSuite:
 - drop external tables in default database (3 seconds, 5 milliseconds)
 - add/drop partitions - external table (2 seconds, 750 milliseconds)
 - create/drop database - location without pre-created directory (500 milliseconds)
 - create/drop database - location with pre-created directory (407 milliseconds)
 - drop database containing tables - CASCADE (453 milliseconds)
 - drop an empty database - CASCADE (375 milliseconds)
 - drop database containing tables - RESTRICT (328 milliseconds)
 - drop an empty database - RESTRICT (391 milliseconds)
 - CREATE TABLE LIKE an external data source table (953 milliseconds)
 - CREATE TABLE LIKE an external Hive serde table (3 seconds, 782 milliseconds)
 - desc table for data source table - no user-defined schema (1 second, 150 milliseconds)
```

```
MetastoreDataSourcesSuite
 - CTAS: persisted bucketed data source table (875 milliseconds)
```

```
ShowCreateTableSuite:
 - simple external hive table (78 milliseconds)
```

```
PartitionedTablePerfStatsSuite:
 - hive table: partitioned pruned table reports only selected files (1 second, 109 milliseconds)
- datasource table: partitioned pruned table reports only selected files (860 milliseconds)
 - hive table: lazy partition pruning reads only necessary partition data (859 milliseconds)
 - datasource table: lazy partition pruning reads only necessary partition data (1 second, 219 milliseconds)
 - hive table: lazy partition pruning with file status caching enabled (875 milliseconds)
 - datasource table: lazy partition pruning with file status caching enabled (890 milliseconds)
 - hive table: file status caching respects refresh table and refreshByPath (922 milliseconds)
 - datasource table: file status caching respects refresh table and refreshByPath (640 milliseconds)
 - hive table: file status cache respects size limit (469 milliseconds)
 - datasource table: file status cache respects size limit (453 milliseconds)
 - datasource table: table setup does not scan filesystem (328 milliseconds)
 - hive table: table setup does not scan filesystem (313 milliseconds)
 - hive table: num hive client calls does not scale with partition count (5 seconds, 431 milliseconds)
 - datasource table: num hive client calls does not scale with partition count (4 seconds, 79 milliseconds)
 - hive table: files read and cached when filesource partition management is off (656 milliseconds)
 - datasource table: all partition data cached in memory when partition management is off (484 milliseconds)
 - SPARK-18700: table loaded only once even when resolved concurrently (2 seconds, 578 milliseconds)
```

```
HiveSparkSubmitSuite:
 - temporary Hive UDF: define a UDF and use it (1 second, 745 milliseconds)
 - permanent Hive UDF: define a UDF and use it (406 milliseconds)
 - permanent Hive UDF: use a already defined permanent function (375 milliseconds)
 - SPARK-8368: includes jars passed in through --jars (391 milliseconds)
 - SPARK-8020: set sql conf in spark conf (156 milliseconds)
 - SPARK-8489: MissingRequirementError during reflection (187 milliseconds)
 - SPARK-9757 Persist Parquet relation with decimal column (157 milliseconds)
 - SPARK-11009 fix wrong result of Window function in cluster mode (156 milliseconds)
 - SPARK-14244 fix window partition size attribute binding failure (156 milliseconds)
 - set spark.sql.warehouse.dir (172 milliseconds)
 - set hive.metastore.warehouse.dir (156 milliseconds)
 - SPARK-16901: set javax.jdo.option.ConnectionURL (157 milliseconds)
 - SPARK-18360: default table path of tables in default database should depend on the location of default database (172 milliseconds)
```

```
UtilsSuite:
 - resolveURIs with multiple paths (0 milliseconds)
```

```
CheckpointSuite:
 - recovery with file input stream (4 seconds, 452 milliseconds)
```

Note: after resolving the aborted tests, there is a test failure identified as below:

```
OrcSourceSuite:
- SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
  org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
```

This does not look due to this problem so this PR does not fix it here.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16451 from HyukjinKwon/all-path-resource-fixes.
2017-01-10 13:19:21 +00:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Ryan Williams afd9bc1d8a [SPARK-17807][CORE] split test-tags into test-JAR
Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #16311 from ryan-williams/tt.
2016-12-21 16:37:20 -08:00
Shixiong Zhu 95efc895e9 [SPARK-18588][SS][KAFKA] Create a new KafkaConsumer when error happens to fix the flaky test
## What changes were proposed in this pull request?

When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.

This PR also assigns a new group id to the new created consumer for a possible race condition:  the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)

## How was this patch tested?

In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console , it ran this flaky test 120 times and all passed.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16282 from zsxwing/kafka-fix.
2016-12-21 15:39:36 -08:00
Shixiong Zhu e104e55c16 [SPARK-18588][TESTS] Ignore KafkaSourceStressForDontFailOnDataLossSuite
## What changes were proposed in this pull request?

Disable KafkaSourceStressForDontFailOnDataLossSuite for now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16275 from zsxwing/ignore-flaky-test.
2016-12-13 18:36:36 -08:00
Takeshi YAMAMURO b08b500456
[SPARK-18620][STREAMING][KINESIS] Flatten input rates in timeline for streaming + kinesis
## What changes were proposed in this pull request?
This pr is to make input rates in timeline more flat for spark streaming + kinesis.
Since kinesis workers fetch records and push them into block generators in bulk, timeline in web UI has many spikes when `maxRates` applied (See a Figure.1 below). This fix splits fetched input records into multiple `adRecords` calls.

Figure.1 Apply `maxRates=500` in vanilla Spark
<img width="1084" alt="apply_limit in_vanilla_spark" src="https://cloud.githubusercontent.com/assets/692303/20823861/4602f300-b89b-11e6-95f3-164a37061305.png">

Figure.2 Apply `maxRates=500` in Spark with my patch
<img width="1056" alt="apply_limit in_spark_with_my_patch" src="https://cloud.githubusercontent.com/assets/692303/20823882/6c46352c-b89b-11e6-81ab-afd8abfe0cfe.png">

## How was this patch tested?
Add tests to check to split input records into multiple `addRecords` calls.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #16114 from maropu/SPARK-18620.
2016-12-10 05:32:04 +08:00
Tathagata Das 458fa3325e [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json
## What changes were proposed in this pull request?

- Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log.
- Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it.

Other minor changes
- Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch.

## How was this patch tested?

Updated unit test.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16205 from tdas/SPARK-18776.
2016-12-08 17:53:34 -08:00
Michael Armbrust 70b2bf717d [SPARK-18754][SS] Rename recentProgresses to recentProgress
Based on an informal survey, users find this option easier to understand / remember.

Author: Michael Armbrust <michael@databricks.com>

Closes #16182 from marmbrus/renameRecentProgress.
2016-12-07 15:36:29 -08:00
Shixiong Zhu edc87e1892 [SPARK-18588][TESTS] Fix flaky test: KafkaSourceStressForDontFailOnDataLossSuite
## What changes were proposed in this pull request?

Fixed the following failures:

```
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout.
```

```
sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146)
Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
	at java.util.ArrayList.addAll(ArrayList.java:577)
	at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257)
	at org.apache.kafka.clients.Metadata.update(Metadata.java:177)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582)
	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
	at
...
```

## How was this patch tested?

Tested in #16048 by running many times.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16109 from zsxwing/fix-kafka-flaky-test.
2016-12-07 13:47:44 -08:00
Tathagata Das 5c6bcdbda4 [SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven
## What changes were proposed in this pull request?

Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources.

## How was this patch tested?

Manually ran maven test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16183 from tdas/SPARK-18671-1.
2016-12-06 21:51:38 -08:00
Tathagata Das 1ef6b296d7 [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured Streaming log formats
## What changes were proposed in this pull request?

To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.

## How was this patch tested?
new unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16128 from tdas/SPARK-18671.
2016-12-06 13:05:22 -08:00
Reynold Xin c7c7265950 [SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT
## What changes were proposed in this pull request?
This patch bumps master branch version to 2.2.0-SNAPSHOT.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16126 from rxin/SPARK-18695.
2016-12-02 21:09:37 -08:00
Tathagata Das c3d08e2f29 [SPARK-18516][SQL] Split state and progress in streaming
This PR separates the status of a `StreamingQuery` into two separate APIs:
 - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
 - `recentProgress` - an array of statistics about the most recent microbatches that have executed.

A recent progress contains the following information:
```
{
  "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
  "name" : "query-29",
  "timestamp" : 1479705392724,
  "inputRowsPerSecond" : 230.76923076923077,
  "processedRowsPerSecond" : 10.869565217391303,
  "durationMs" : {
    "triggerExecution" : 276,
    "queryPlanning" : 3,
    "getBatch" : 5,
    "getOffset" : 3,
    "addBatch" : 234,
    "walCommit" : 30
  },
  "currentWatermark" : 0,
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[topic-14]]",
    "startOffset" : {
      "topic-14" : {
        "2" : 0,
        "4" : 1,
        "1" : 0,
        "3" : 0,
        "0" : 0
      }
    },
    "endOffset" : {
      "topic-14" : {
        "2" : 1,
        "4" : 2,
        "1" : 0,
        "3" : 0,
        "0" : 1
      }
    },
    "numRecords" : 3,
    "inputRowsPerSecond" : 230.76923076923077,
    "processedRowsPerSecond" : 10.869565217391303
  } ]
}
```

Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #15954 from marmbrus/queryProgress.
2016-11-29 17:24:17 -08:00
hyukjinkwon f830bb9170
[SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation
## What changes were proposed in this pull request?

This PR make `sbt unidoc` complete with Java 8.

This PR roughly includes several fixes as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```diff
  - * A column that will be computed based on the data in a [[DataFrame]].
  + * A column that will be computed based on the data in a `DataFrame`.
  ```

- Fix throws annotations so that they are recognisable in javadoc

- Fix URL links to `<a href="http..."></a>`.

  ```diff
  - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
  + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
  + * Decision tree (Wikipedia)</a> model for regression.
  ```

  ```diff
  -   * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
  +   * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
  +   * Receiver operating characteristic (Wikipedia)</a>
  ```

- Fix < to > to

  - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.

  - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558

- Fix `</p>` complaint

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
2016-11-29 09:41:32 +00:00
Shixiong Zhu 1633ff3b6c [SPARK-18588][SS][KAFKA] Ignore the flaky kafka test
## What changes were proposed in this pull request?

Ignore the flaky test to unblock other PRs while I'm debugging it.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16051 from zsxwing/ignore-flaky-kafka-test.
2016-11-28 21:04:20 -08:00
Shixiong Zhu d0212eb0f2 [SPARK-18530][SS][KAFKA] Change Kafka timestamp column type to TimestampType
## What changes were proposed in this pull request?

Changed Kafka timestamp column type to TimestampType.

## How was this patch tested?

`test("Kafka column types")`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15969 from zsxwing/SPARK-18530.
2016-11-22 16:49:15 -08:00
Shixiong Zhu 2fd101b2f0 [SPARK-18373][SPARK-18529][SS][KAFKA] Make failOnDataLoss=false work with Spark jobs
## What changes were proposed in this pull request?

This PR adds `CachedKafkaConsumer.getAndIgnoreLostData` to handle corner cases of `failOnDataLoss=false`.

It also resolves [SPARK-18529](https://issues.apache.org/jira/browse/SPARK-18529) after refactoring codes: Timeout will throw a TimeoutException.

## How was this patch tested?

Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created `test("stress test for failOnDataLoss=false")` which should cover most of corner cases.

I also modified some existing tests to test for both `failOnDataLoss=false` and `failOnDataLoss=true` to make sure it doesn't break existing logic.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15820 from zsxwing/failOnDataLoss.
2016-11-22 14:15:57 -08:00
hyukjinkwon d5b1d5fc80
[SPARK-18445][BUILD][DOCS] Fix the markdown for Note:/NOTE:/Note that/'''Note:''' across Scala/Java API documentation
## What changes were proposed in this pull request?

It seems in Scala/Java,

- `Note:`
- `NOTE:`
- `Note that`
- `'''Note:'''`
- `note`

This PR proposes to fix those to `note` to be consistent.

**Before**

- Scala
  ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)

- Java
  ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)

**After**

- Scala
  ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)

- Java
  ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)

## How was this patch tested?

The notes were found via

```bash
grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note that " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

And then fixed one by one comparing with API documentation/access modifiers.

After that, manually tested via `jekyll build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15889 from HyukjinKwon/SPARK-18437.
2016-11-19 11:24:15 +00:00
Sean Owen 43a26899e5
[SPARK-18400][STREAMING] NPE when resharding Kinesis Stream
## What changes were proposed in this pull request?

Avoid NPE in KinesisRecordProcessor when shutdown happens without successful init

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15882 from srowen/SPARK-18400.
2016-11-16 10:16:36 +00:00
cody koeninger 89d1fa58db [SPARK-17510][STREAMING][KAFKA] config max rate on a per-partition basis
## What changes were proposed in this pull request?

Allow configuration of max rate on a per-topicpartition basis.
## How was this patch tested?

Unit tests.

The reporter (Jeff Nadler) said he could test on his workload, so let's wait on that report.

Author: cody koeninger <cody@koeninger.org>

Closes #15132 from koeninger/SPARK-17510.
2016-11-14 11:10:37 -08:00
wm624@hotmail.com 22a9d064e9
[SPARK-14914][CORE] Fix Resource not closed after using, for unit tests and example
## What changes were proposed in this pull request?

This is a follow-up work of #15618.

Close file source;
For any newly created streaming context outside the withContext, explicitly close the context.

## How was this patch tested?

Existing unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15818 from wangmiao1981/rtest.
2016-11-10 10:54:36 +00:00
Tyson Condie 3f62e1b5d9 [SPARK-17829][SQL] Stable format for offset log
## What changes were proposed in this pull request?

Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues:
It can break across spark releases (though this is not the only thing preventing us from upgrading a running query)
It is unnecessarily opaque to the user.
I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option.
## How was this patch tested?

Tests were added for KafkaSourceOffset in [KafkaSourceOffsetSuite](external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala) and for LongOffset in [OffsetSuite](sql/core/src/test/scala/org/apache/spark/sql/streaming/OffsetSuite.scala)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

zsxwing marmbrus

Author: Tyson Condie <tcondie@gmail.com>
Author: Tyson Condie <tcondie@clash.local>

Closes #15626 from tcondie/spark-8360.
2016-11-09 15:03:22 -08:00
Tathagata Das b06c23db9a [SPARK-18283][STRUCTURED STREAMING][KAFKA] Added test to check whether default starting offset in latest
## What changes were proposed in this pull request?

Added test to check whether default starting offset in latest

## How was this patch tested?
new unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15778 from tdas/SPARK-18283.
2016-11-07 10:43:36 -08:00
cody koeninger 67659c9afa [SPARK-18212][SS][KAFKA] increase executor poll timeout
## What changes were proposed in this pull request?

Increase poll timeout to try and address flaky test

## How was this patch tested?

Ran existing unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #15737 from koeninger/SPARK-18212.
2016-11-03 14:43:25 -07:00
cody koeninger 1042325805 [SPARK-17813][SQL][KAFKA] Maximum data per trigger
## What changes were proposed in this pull request?

maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.

## How was this patch tested?

Added unit test

Author: cody koeninger <cody@koeninger.org>

Closes #15527 from koeninger/SPARK-17813.
2016-10-27 10:30:59 -07:00
cody koeninger 268ccb9a48 [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream
## What changes were proposed in this pull request?

startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy

assign with specific topicpartitions as a consumer strategy

## How was this patch tested?

Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #15504 from koeninger/SPARK-17812.
2016-10-21 15:55:04 -07:00
jerryshao 947f4f2527 [SPARK-17999][KAFKA][SQL] Add getPreferredLocations for KafkaSourceRDD
## What changes were proposed in this pull request?

The newly implemented Structured Streaming `KafkaSource` did calculate the preferred locations for each topic partition, but didn't offer this information through RDD's `getPreferredLocations` method. So here propose to add this method in `KafkaSourceRDD`.

## How was this patch tested?

Manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #15545 from jerryshao/SPARK-17999.
2016-10-20 10:50:34 -07:00
cody koeninger cd106b050f [SPARK-17841][STREAMING][KAFKA] drain commitQueue
## What changes were proposed in this pull request?

Actually drain commit queue rather than just iterating it.
iterator() on a concurrent linked queue won't remove items from the queue, poll() will.

## How was this patch tested?
Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #15407 from koeninger/SPARK-17841.
2016-10-18 14:01:49 -07:00
Tathagata Das 7106866c22 [SPARK-17731][SQL][STREAMING] Metrics for structured streaming
## What changes were proposed in this pull request?

Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing

Specifically, this PR adds the following public APIs changes.

### New APIs
- `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)

- `StreamingQueryStatus` has the following important fields
  - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
  - processingRate - Current rate (rows/sec) at which the query is processing data from
                                  all the sources
  - ~~outputRate~~ - *Does not work with wholestage codegen*
  - latency - Current average latency between the data being available in source and the sink writing the corresponding output
  - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
  - sinkStatus: SinkStatus - Current status of the sink
  - triggerStatus - Low-level detailed status of the last completed/currently active trigger
    - latencies - getOffset, getBatch, full trigger, wal writes
    - timestamps - trigger start, finish, after getOffset, after getBatch
    - numRows - input, output, state total/updated rows for aggregations

- `SourceStatus` has the following important fields
  - inputRate - Current rate (rows/sec) at which data is being generated by the source
  - processingRate - Current rate (rows/sec) at which the query is processing data from the source
  - triggerStatus - Low-level detailed status of the last completed/currently active trigger

- Python API for `StreamingQuery.status()`

### Breaking changes to existing APIs
**Existing direct public facing APIs**
- Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
  - Branch 2.0 should have it deprecated, master should have it removed.

**Existing advanced listener APIs**
- `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
   - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)

- Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.

- Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.

- For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.

## How was this patch tested?

Old and new unit tests.
- Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
- New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
- New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
- Source-specific tests for making sure input rows are counted are is source-specific test suites.
- Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.

Metrics also manually tested using Ganglia sink

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15307 from tdas/SPARK-17731.
2016-10-13 13:36:26 -07:00
Shixiong Zhu 08eac35609 [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
## What changes were proposed in this pull request?

Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets.

## How was this patch tested?

Existing tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15397 from zsxwing/SPARK-17834.
2016-10-13 13:31:50 -07:00
cody koeninger f9a56a153e [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll twice
## What changes were proposed in this pull request?

Alternative approach to https://github.com/apache/spark/pull/15387

Author: cody koeninger <cody@koeninger.org>

Closes #15401 from koeninger/SPARK-17782-alt.
2016-10-12 15:22:06 -07:00
Shixiong Zhu 75b9e35141 [SPARK-17346][SQL][TESTS] Fix the flaky topic deletion in KafkaSourceStressSuite
## What changes were proposed in this pull request?

A follow up Pr for SPARK-17346 to fix flaky `org.apache.spark.sql.kafka010.KafkaSourceStressSuite`.

Test log: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1855/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/

Looks like deleting the Kafka internal topic `__consumer_offsets` is flaky. This PR just simply ignores internal topics.

## How was this patch tested?

Existing tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15384 from zsxwing/SPARK-17346-flaky-test.
2016-10-11 10:53:07 -07:00
Christian Kadner 49d11d4998 [SPARK-17803][TESTS] Upgrade docker-client dependency
[SPARK-17803: Docker integration tests don't run with "Docker for Mac"](https://issues.apache.org/jira/browse/SPARK-17803)

## What changes were proposed in this pull request?

This PR upgrades the [docker-client](https://mvnrepository.com/artifact/com.spotify/docker-client) dependency from [3.6.6](https://mvnrepository.com/artifact/com.spotify/docker-client/3.6.6) to [5.0.2](https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2) to enable _Docker for Mac_ users to run the `docker-integration-tests` out of the box.

The very latest docker-client version is [6.0.0](https://mvnrepository.com/artifact/com.spotify/docker-client/6.0.0) but that has one additional dependency and no usage yet.

## How was this patch tested?

The code change was tested on Mac OS X Yosemite with both _Docker Toolbox_ as well as _Docker for Mac_ and on Linux Ubuntu 14.04.

```
$ build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package

$ build/mvn -Pdocker-integration-tests -Pscala-2.11 -pl :spark-docker-integration-tests_2.11 clean compile test
```

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #15378 from ckadner/SPARK-17803_Docker_for_Mac.
2016-10-06 14:28:49 -07:00
Shixiong Zhu b678e465af [SPARK-17346][SQL][TEST-MAVEN] Generate the sql test jar to fix the maven build
## What changes were proposed in this pull request?

Generate the sql test jar to fix the maven build

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15368 from zsxwing/sql-test-jar.
2016-10-05 18:11:31 -07:00
Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00
Josh Rosen d7ee12211a [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published
This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds.

The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.
2016-09-21 11:38:10 -07:00
sureshthalamati cdea1d1343 [SPARK-17473][SQL] fixing docker integration tests error due to different versions of jars.
## What changes were proposed in this pull request?
Docker tests are using older version  of jersey jars (1.19),  which was used in older releases of spark.  In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After  upgrade to new versions, docker tests  are  failing with AbstractMethodError.  Now that spark is upgraded  to 2.x jersey version, using of  shaded docker jars  may not be required any more.  Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one.

## How was this patch tested?

Tested  using existing  docker-integration-tests

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #15114 from sureshthalamati/docker_testfix-spark-17473.
2016-09-19 09:56:16 -07:00
Adam Roberts fc1efb720c [SPARK-17534][TESTS] Increase timeouts for DirectKafkaStreamSuite tests
**## What changes were proposed in this pull request?**
There are two tests in this suite that are particularly flaky on the following hardware:

2x Intel(R) Xeon(R) CPU E5-2697 v2  2.70GHz and 16 GB of RAM, 1 TB HDD

This simple PR increases the timeout times and batch duration so they can reliably pass

**## How was this patch tested?**
Existing unit tests with the two core box where I was seeing the failures often

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #15094 from a-roberts/patch-6.
2016-09-16 10:20:50 +01:00
Liwei Lin 3ce3a282c8 [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.
2016-09-07 10:04:00 +01:00
Josh Rosen a133057ce5 [SPARK-17229][SQL] PostgresDialect shouldn't widen float and short types during reads
## What changes were proposed in this pull request?

When reading float4 and smallint columns from PostgreSQL, Spark's `PostgresDialect` widens these types to Decimal and Integer rather than using the narrower Float and Short types. According to https://www.postgresql.org/docs/7.1/static/datatype.html#DATATYPE-TABLE, Postgres maps the `smallint` type to a signed two-byte integer and the `real` / `float4` types to single precision floating point numbers.

This patch fixes this by adding more special-cases to `getCatalystType`, similar to what was done for the Derby JDBC dialect. I also fixed a similar problem in the write path which causes Spark to create integer columns in Postgres for what should have been ShortType columns.

## How was this patch tested?

New test cases in `PostgresIntegrationSuite` (which I ran manually because Jenkins can't run it right now).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14796 from JoshRosen/postgres-jdbc-type-fixes.
2016-08-25 23:22:40 +02:00
Luciano Resende 67f025d90e [SPARK-17023][BUILD] Upgrade to Kafka 0.10.0.1 release
## What changes were proposed in this pull request?
Update Kafka streaming connector to use Kafka 0.10.0.1 release

## How was this patch tested?
Tested via Spark unit and integration tests

Author: Luciano Resende <lresende@apache.org>

Closes #14606 from lresende/kafka-upgrade.
2016-08-13 11:42:38 +01:00
Mariusz Strzelecki 29081b587f [SPARK-16950] [PYSPARK] fromOffsets parameter support in KafkaUtils.createDirectStream for python3
## What changes were proposed in this pull request?

Ability to use KafkaUtils.createDirectStream with starting offsets in python 3 by using java.lang.Number instead of Long during param mapping in scala helper. This allows py4j to pass Integer or Long to the map and resolves ClassCastException problems.

## How was this patch tested?

unit tests

jerryshao  - could you please look at this PR?

Author: Mariusz Strzelecki <mariusz.strzelecki@allegrogroup.com>

Closes #14540 from szczeles/kafka_pyspark.
2016-08-09 09:44:43 -07:00
Holden Karau 9216901d52 [SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting
## What changes were proposed in this pull request?

Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.

## How was this patch tested?

Existing tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #14407 from holdenk/SPARK-16779.
2016-08-08 15:54:03 -07:00
Ekasit Kijsipongse 1f96c97f23 [SPARK-13238][CORE] Add ganglia dmax parameter
The current ganglia reporter doesn't set metric expiration time (dmax). The metrics of all finished applications are indefinitely left displayed in ganglia web. The dmax parameter allows user to set the lifetime of the metrics. The default value is 0 for compatibility with previous versions.

Author: Ekasit Kijsipongse <ekasitk@gmail.com>

Closes #11127 from ekasitk/ganglia-dmax.
2016-08-05 13:07:52 -07:00
Yuming Wang 39a2b2ea74 [SPARK-16625][SQL] General data types to be mapped to Oracle
## What changes were proposed in this pull request?

Spark will convert **BooleanType** to **BIT(1)**, **LongType** to **BIGINT**, **ByteType**  to **BYTE** when saving DataFrame to Oracle, but Oracle does not support BIT, BIGINT and BYTE types.

This PR is convert following _Spark Types_ to _Oracle types_ refer to [Oracle Developer's Guide](https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/)

Spark Type | Oracle
----|----
BooleanType | NUMBER(1)
IntegerType | NUMBER(10)
LongType | NUMBER(19)
FloatType | NUMBER(19, 4)
DoubleType | NUMBER(19, 4)
ByteType | NUMBER(3)
ShortType | NUMBER(5)

## How was this patch tested?

Add new tests in [JDBCSuite.scala](22b0c2a422 (diff-dc4b58851b084b274df6fe6b189db84d)) and [OracleDialect.scala](22b0c2a422 (diff-5e0cadf526662f9281aa26315b3750ad))

Author: Yuming Wang <wgyumg@gmail.com>

Closes #14377 from wangyum/SPARK-16625.
2016-08-05 16:11:54 +01:00
hyukjinkwon f93ad4fe7c [SPARK-16776][STREAMING] Replace deprecated API in KafkaTestUtils for 0.10.0.
## What changes were proposed in this pull request?

This PR replaces the old Kafka API to 0.10.0 ones in `KafkaTestUtils`.

The change include:

 - `Producer` to `KafkaProducer`
 - Change configurations to equalvant ones. (I referred [here](http://kafka.apache.org/documentation.html#producerconfigs) for 0.10.0 and [here](http://kafka.apache.org/082/documentation.html#producerconfigs
) for old, 0.8.2).

This PR will remove the build warning as below:

```scala
[WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:71: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead.
[WARNING]   private var producer: Producer[String, String] = _
[WARNING]                         ^
[WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:181: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead.
[WARNING]     producer = new Producer[String, String](new ProducerConfig(producerConfiguration))
[WARNING]                    ^
[WARNING] .../spark/streaming/kafka010/KafkaTestUtils.scala:181: class ProducerConfig in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerConfig instead.
[WARNING]     producer = new Producer[String, String](new ProducerConfig(producerConfiguration))
[WARNING]                                                 ^
[WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:182: class KeyedMessage in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerRecord instead.
[WARNING]     producer.send(messages.map { new KeyedMessage[String, String](topic, _ ) }: _*)
[WARNING]                                      ^
[WARNING] four warnings found
[WARNING] warning: [options] bootstrap class path not set in conjunction with -source 1.7
[WARNING] 1 warning
```

## How was this patch tested?

Existing tests that use `KafkaTestUtils` should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14416 from HyukjinKwon/SPARK-16776.
2016-08-01 06:56:52 -07:00
Tathagata Das 03c27435ae [TEST][STREAMING] Fix flaky Kafka rate controlling test
## What changes were proposed in this pull request?

The current test is incorrect, because
- The expected number of messages does not take into account that the topic has 2 partitions, and rate is set per partition.
- Also in some cases, the test ran out of data in Kafka while waiting for the right amount of data per batch.

The PR
- Reduces the number of partitions to 1
- Adds more data to Kafka
- Runs with 0.5 second so that batches are created slowly

## How was this patch tested?
Ran many times locally, going to run it many times in Jenkins

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14361 from tdas/kafka-rate-test-fix.
2016-07-26 00:41:46 -07:00
Xin Ren 21a6dd2aef [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.
2016-07-19 11:59:46 +01:00
Reynold Xin ffcb6e055a [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.
2016-07-11 09:42:56 -07:00
cody koeninger fd6e8f0e22 [SPARK-13569][STREAMING][KAFKA] pattern based topic subscription
## What changes were proposed in this pull request?
Allow for kafka topic subscriptions based on a regex pattern.

## How was this patch tested?
Unit tests, manual tests

Author: cody koeninger <cody@koeninger.org>

Closes #14026 from koeninger/SPARK-13569.
2016-07-08 17:47:58 -07:00
cody koeninger b8ebf63c1e [SPARK-16212][STREAMING][KAFKA] apply test tweaks from 0-10 to 0-8 as well
## What changes were proposed in this pull request?

Bring the kafka-0-8 subproject up to date with some test modifications from development on 0-10.

Main changes are
- eliminating waits on concurrent queue in favor of an assert on received results,
- atomics instead of volatile (although this probably doesn't matter)
- increasing uniqueness of topic names

## How was this patch tested?

Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #14073 from koeninger/kafka-0-8-test-direct-cleanup.
2016-07-06 16:21:41 -07:00
cody koeninger 1fca9da95d [SPARK-16212][STREAMING][KAFKA] use random port for embedded kafka
## What changes were proposed in this pull request?

Testing for 0.10 uncovered an issue with a fixed port number being used in KafkaTestUtils.  This is making a roughly equivalent fix for the 0.8 connector

## How was this patch tested?

Unit tests, manual tests

Author: cody koeninger <cody@koeninger.org>

Closes #14018 from koeninger/kafka-0-8-test-port.
2016-07-05 11:45:54 -07:00
cody koeninger fbfd0ab9d7 [SPARK-12177][STREAMING][KAFKA] limit api surface area
## What changes were proposed in this pull request?
This is an alternative to the refactoring proposed by https://github.com/apache/spark/pull/13996

## How was this patch tested?

unit tests
also tested under scala 2.10 via
mvn -Dscala-2.10

Author: cody koeninger <cody@koeninger.org>

Closes #13998 from koeninger/kafka-0-10-refactor.
2016-07-01 00:53:36 -07:00
cody koeninger c62263340e [SPARK-16212][STREAMING][KAFKA] code cleanup from review feedback
## What changes were proposed in this pull request?
code cleanup in kafka-0-8 to match suggested changes for kafka-0-10 branch

## How was this patch tested?
unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #13908 from koeninger/kafka-0-8-cleanup.
2016-06-30 13:16:58 -07:00
Tathagata Das de8ab313e1 [SPARK-12177][TEST] Removed test to avoid compilation issue in scala 2.10
## What changes were proposed in this pull request?

The commented lines failed scala 2.10 build. This is because of change in behavior of case classes between 2.10 and 2.11. In scala 2.10, if companion object of a case class has explicitly defined apply(), then the implicit apply method is not generated. In scala 2.11 it is generated. Hence, the lines compile fine in 2.11 but not in 2.10.

This simply comments the tests to fix broken build. Correct solution is pending.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13992 from tdas/SPARK-12177.
2016-06-30 18:06:04 +08:00
cody koeninger dedbceec1e [SPARK-12177][STREAMING][KAFKA] Update KafkaDStreams to new Kafka 0.10 Consumer API
## What changes were proposed in this pull request?

New Kafka consumer api for the released 0.10 version of Kafka

## How was this patch tested?

Unit tests, manual tests

Author: cody koeninger <cody@koeninger.org>

Closes #11863 from koeninger/kafka-0.9.
2016-06-29 23:21:03 -07:00
Sean Owen f51dfe616b [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API
## What changes were proposed in this pull request?

- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13606 from srowen/SPARK-15086.
2016-06-12 11:44:33 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Marcelo Vanzin 57adb77e6b [SPARK-15451][BUILD] Use jdk7's rt.jar when available.
This helps with preventing jdk8-specific calls being checked in,
because PR builders are running the compiler with the wrong settings.

If the JAVA_7_HOME env variable is set, assume it points at
a jdk7 and use its rt.jar when invoking javac. For zinc, just run
it with jdk7, and disable it when building jdk8-specific code.

A big note for sbt usage: adding the bootstrap options forces sbt
to fork the compiler, and that disables incremental compilation.
That means that it's really not convenient to use for normal
development, but should be ok for automated builds.

Tested with JAVA_HOME=jdk8 and JAVA_7_HOME=jdk7:
- mvn + zinc
- mvn sans zinc
- sbt

Verified that in all cases, jdk8-specific library calls fail to
compile.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13272 from vanzin/SPARK-15451.
2016-05-31 16:54:34 -07:00
Reynold Xin 73178c7556 [SPARK-15633][MINOR] Make package name for Java tests consistent
## What changes were proposed in this pull request?
This is a simple patch that makes package names for Java 8 test suites consistent. I moved everything to test.org.apache.spark to we can test package private APIs properly. Also added "java8" as the package name so we can easily run all the tests related to Java 8.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13364 from rxin/SPARK-15633.
2016-05-27 21:20:02 -07:00
Shixiong Zhu c9c1c0e54d [SPARK-15508][STREAMING][TESTS] Fix flaky test: JavaKafkaStreamSuite.testKafkaStream
## What changes were proposed in this pull request?

`JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.

This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.

Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13281 from zsxwing/flaky-kafka-test.
2016-05-24 22:01:40 -07:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Sean Owen f5576a052d [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

(Retry of https://github.com/apache/spark/pull/13049)

- update to httpclient 4.5 / httpcore 4.4
- remove some defunct exclusions
- manage httpmime version to match
- update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)

## How was this patch tested?

Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`

Author: Sean Owen <sowen@cloudera.com>

Closes #13117 from srowen/SPARK-12972.2.
2016-05-15 15:56:46 +01:00
Sean Owen 10a8389674 Revert "[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient"
This reverts commit c74a6c3f23.
2016-05-13 13:50:26 +01:00
Sean Owen c74a6c3f23 [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

- update httpcore/httpclient to latest
- centralize version management
- remove excludes that are no longer relevant according to SBT/Maven dep graphs
- also manage httpmime to match httpclient

## How was this patch tested?

Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh  output

Author: Sean Owen <sowen@cloudera.com>

Closes #13049 from srowen/SPARK-12972.
2016-05-13 09:00:50 +01:00
Brian O'Neill 81e3bfc16c [SPARK-14421] Upgrades protobuf dependency to 2.6.1 for the new version of KCL, and…
## What changes were proposed in this pull request?

When running with Kinesis Consumer Library (KCL), against a stream that contains aggregated data, the KCL needs access to protobuf to de-aggregate the records.   Without this patch, that results in the following error message:

```
   Caused by: java.lang.ClassNotFoundException: com.google.protobuf.ProtocolStringList
```

This PR upgrades the protobuf dependency within the kinesis-asl-assembly, and relocates that package (as not to conflict with Spark's use of 2.5.0), which fixes the above CNFE.

## How was this patch tested?

Used kinesis word count example against a stream containing aggregated data.

See: SPARK-14421

Author: Brian O'Neill <bone@alumni.brown.edu>

Closes #13054 from boneill42/protobuf-relocation-for-kcl.
2016-05-12 20:10:33 +01:00
cody koeninger 89e67d6667 [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions

## How was this patch tested?
Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #12946 from koeninger/SPARK-15085.
2016-05-11 12:15:41 -07:00
Xin Ren 86475520f8 [SPARK-14936][BUILD][TESTS] FlumePollingStreamSuite is slow
https://issues.apache.org/jira/browse/SPARK-14936

## What changes were proposed in this pull request?

FlumePollingStreamSuite contains two tests which run for a minute each. This seems excessively slow and we should speed it up if possible.

In this PR, instead of creating `StreamingContext` directly from `conf`, here an underlying `SparkContext` is created before all and it is used to create  each`StreamingContext`.

Running time is reduced by avoiding multiple `SparkContext` creations and destroys.

## How was this patch tested?

Tested on my local machine running `testOnly *.FlumePollingStreamSuite`

Author: Xin Ren <iamshrek@126.com>

Closes #12845 from keypointt/SPARK-14936.
2016-05-10 15:12:47 -07:00
Shixiong Zhu 9533f5390a [SPARK-6005][TESTS] Fix flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery
## What changes were proposed in this pull request?

Because this test extracts data from `DStream.generatedRDDs` before stopping, it may get data before checkpointing. Then after recovering from the checkpoint, `recoveredOffsetRanges` may contain something not in `offsetRangesBeforeStop`, which will fail the test. Adding `Thread.sleep(1000)` before `ssc.stop()` will reproduce this failure.

This PR just moves the logic of `offsetRangesBeforeStop` (also renamed to `offsetRangesAfterStop`) after `ssc.stop()` to fix the flaky test.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12903 from zsxwing/SPARK-6005.
2016-05-10 13:26:53 -07:00
Subhobrata Dey 89f73f6741 [SPARK-14642][SQL] import org.apache.spark.sql.expressions._ breaks udf under functions
## What changes were proposed in this pull request?

PR fixes the import issue which breaks udf functions.

The following code snippet throws an error

```
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

scala> udf((v: String) => v.stripSuffix("-abc"))
<console>:30: error: No TypeTag available for String
       udf((v: String) => v.stripSuffix("-abc"))
```

This PR resolves the issue.

## How was this patch tested?

patch tested with unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12458 from sbcd90/udfFuncBreak.
2016-05-10 12:32:56 -07:00
Luciano Resende a03c5e68ab [SPARK-14738][BUILD] Separate docker integration tests from main build
## What changes were proposed in this pull request?

Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt

## How was this patch tested?

Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test

## Other comments

Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348

Author: Luciano Resende <lresende@apache.org>

Closes #12508 from lresende/docker.
2016-05-06 12:25:45 +01:00
Luciano Resende 104430223e [SPARK-14589][SQL] Enhance DB2 JDBC Dialect docker tests
## What changes were proposed in this pull request?

Enhance the DB2 JDBC Dialect docker tests as they seemed to have had some issues on previous merge causing some tests to fail.

## How was this patch tested?

By running the integration tests locally.

Author: Luciano Resende <lresende@apache.org>

Closes #12348 from lresende/SPARK-14589.
2016-05-05 10:54:48 +01:00
mcheah b7fdc23ccc [SPARK-12154] Upgrade to Jersey 2
## What changes were proposed in this pull request?

Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.

## How was this patch tested?

I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.

Author: mcheah <mcheah@palantir.com>

Closes #12715 from mccheah/feature/upgrade-jersey.
2016-05-05 10:51:03 +01:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
Dongjoon Hyun d34d650378 [SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
## What changes were proposed in this pull request?

Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.

- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)

## How was this patch tested?

After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12632 from dongjoon-hyun/SPARK-14868.
2016-04-24 20:40:03 -07:00
Sean Owen 8bd05c9db2 [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
## What changes were proposed in this pull request?

`JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12418 from srowen/SPARK-8393.
2016-04-21 11:03:16 +01:00
Josh Rosen 90933e2afa [HOTFIX] Ignore all Docker integration tests
The Docker integration tests are failing very often (https://spark-tests.appspot.com/failed-tests) so I think we should disable these suites for now until we have time to improve them.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12549 from JoshRosen/ignore-all-docker-tests.
2016-04-20 20:30:43 -07:00
Luciano Resende 68450c8c6e [SPARK-14504][SQL] Enable Oracle docker tests
## What changes were proposed in this pull request?

Enable Oracle docker tests

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12270 from lresende/oracle.
2016-04-18 14:35:10 -07:00
hyukjinkwon 6fc3dc8839 [MINOR][SQL] Remove extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes extra anonymous closure within functional transformations.

For example,

```scala
.map(item => {
  ...
})
```

which can be just simply as below:

```scala
.map { item =>
  ...
}
```

## How was this patch tested?

Related unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12382 from HyukjinKwon/minor-extra-closers.
2016-04-14 09:43:41 +01:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00
Luciano Resende 94de63053e [SPARK-10521][SQL] Utilize Docker for test DB2 JDBC Dialect support
Add integration tests based on docker to test DB2 JDBC dialect support

Author: Luciano Resende <lresende@apache.org>

Closes #9893 from lresende/SPARK-10521.
2016-04-11 16:40:45 -07:00
Dongjoon Hyun aea30a1a9b [SPARK-14465][BUILD] Checkstyle should check all Java files
## What changes were proposed in this pull request?

Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```

## How was this patch tested?

After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12242 from dongjoon-hyun/SPARK-14465.
2016-04-09 21:31:20 -07:00
Marcelo Vanzin 21d5ca128b [SPARK-14134][CORE] Change the package name used for shading classes.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.

Most changes are just noise to fix the logging configs.

For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11941 from vanzin/SPARK-14134.
2016-04-06 19:33:51 -07:00
Dongjoon Hyun d717ae1fd7 [SPARK-14444][BUILD] Add a new scalastyle NoScalaDoc to prevent ScalaDoc-style multiline comments
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
  * is not correct.
  */
```

## How was this patch tested?

Pass the Jenkins tests (including `lint-scala`).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12221 from dongjoon-hyun/SPARK-14444.
2016-04-06 16:02:55 -07:00
Eric Liang 7d29c72f64 [SPARK-14359] Unit tests for java 8 lambda syntax with typed aggregates
## What changes were proposed in this pull request?

Adds unit tests for java 8 lambda syntax with typed aggregates as a follow-up to #12168

## How was this patch tested?

Unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #12181 from ericl/sc-2794-2.
2016-04-05 21:22:20 -05:00
Dongjoon Hyun 3f749f7ed4 [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results
## What changes were proposed in this pull request?

This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.

## How was this patch tested?

Manual and pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12139 from dongjoon-hyun/SPARK-14355.
2016-04-03 18:14:16 -07:00
Dongjoon Hyun 4a6e78abd9 [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.
## What changes were proposed in this pull request?

This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
2016-04-02 17:50:40 -07:00
Josh Rosen a7af6cd2ea [SPARK-14281][TESTS] Fix java8-tests and simplify their build
This patch fixes a compilation / build break in Spark's `java8-tests` and refactors their POM to simplify the build. See individual commit messages for more details.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12073 from JoshRosen/fix-java8-tests.
2016-03-31 13:52:59 -07:00
Shixiong Zhu 24587ce433 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
## What changes were proposed in this pull request?

This PR moves flume back to Spark as per the discussion in the dev mail-list.

## How was this patch tested?

Existing Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11895 from zsxwing/move-flume-back.
2016-03-25 17:37:16 -07:00
proflin c35c60fa91 [SPARK-14028][STREAMING][KINESIS][TESTS] Remove deprecated methods; fix two other warnings
## What changes were proposed in this pull request?

- Removed two methods that has been deprecated since 1.4
- Fixed two other compilation warnings

## How was this patch tested?

existing test suits

Author: proflin <proflin.me@gmail.com>

Closes #11850 from lw-lin/streaming-kinesis-deprecates-warnings.
2016-03-21 08:02:06 +00:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Shixiong Zhu 06dec37455 [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
## What changes were proposed in this pull request?

Currently there are a few sub-projects, each for integrating with different external sources for Streaming.  Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages

- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter

They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.

I have already copied these projects to https://github.com/spark-packages

## How was this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11672 from zsxwing/remove-external-pkg.
2016-03-14 16:56:04 -07:00
Josh Rosen 07cb323e7a [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
2016-03-14 12:22:02 -07:00
Wilson Wu 31d069d4c2 [SPARK-13746][TESTS] stop using deprecated SynchronizedSet
trait SynchronizedSet in package mutable is deprecated

Author: Wilson Wu <wilson888888888@gmail.com>

Closes #11580 from wilson888888888/spark-synchronizedset.
2016-03-14 09:13:29 +00:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Sean Owen 256704c771 [SPARK-13595][BUILD] Move docker, extras modules into external
## What changes were proposed in this pull request?

Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`

## How was this patch tested?

This is tested with Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11523 from srowen/SPARK-13595.
2016-03-09 18:27:44 +00:00
Jason White f19228eed8 [SPARK-12073][STREAMING] backpressure rate controller consumes events preferentially from lagg…
…ing partitions

I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure.

`maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages.

This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%.

Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed.

We're going to try deploying this internally at Shopify to see if this resolves our issue.

tdas koeninger holdenk

Author: Jason White <jason.white@shopify.com>

Closes #10089 from JasonMWhite/rate_controller_offsets.
2016-03-04 16:04:56 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Lin Zhao fb8bb04766 [SPARK-13069][STREAMING] Add "ask" style store() to ActorReciever
Introduces a "ask" style ```store``` in ```ActorReceiver``` as a way to allow actor receiver blocked by back pressure or maxRate.

Author: Lin Zhao <lin@exabeam.com>

Closes #11176 from lin-zhao/SPARK-13069.
2016-02-25 12:32:24 -08:00
Huaxin Gao 8f35d3eac9 [SPARK-13186][STREAMING] migrate away from SynchronizedMap
trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #11250 from huaxingao/spark__13186.
2016-02-22 09:44:32 +00:00
Holden Karau 159198eff6 [SPARK-13165][STREAMING] Replace deprecated synchronizedBuffer in streaming
Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it.

Some notes about how behaviour is different for reviewers:
The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates.

Author: Holden Karau <holden@us.ibm.com>
Author: tedyu <yuzhihong@gmail.com>

Closes #11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.
2016-02-09 08:44:56 +00:00
cody koeninger 140ddef373 [SPARK-10963][STREAMING][KAFKA] make KafkaCluster public
Author: cody koeninger <cody@koeninger.org>

Closes #9007 from koeninger/SPARK-10963.
2016-02-07 12:52:00 +00:00
Josh Rosen 289373b28c [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).

The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).

After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10608 from JoshRosen/SPARK-6363.
2016-01-30 00:20:28 -08:00
Shixiong Zhu d8fefab4d8 [HOTFIX][BUILD][TEST-MAVEN] Remove duplicate dependency
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10868 from zsxwing/hotfix-akka-pom.
2016-01-22 12:33:18 -08:00
Shixiong Zhu b7d74a602f [SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project
Include the following changes:

1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
3. Update the ActorWordCount example and add the JavaActorWordCount example
4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10744 from zsxwing/streaming-akka-2.
2016-01-20 13:55:41 -08:00
Kousuke Saruta 39ae04e6b7 [SPARK-12692][BUILD][STREAMING] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10685 from sarutak/SPARK-12692-followup-streaming.
2016-01-11 21:06:22 -08:00
Jacek Laskowski b313badaa0 [STREAMING][MINOR] Typo fixes
Author: Jacek Laskowski <jacek@japila.pl>

Closes #10698 from jaceklaskowski/streaming-kafka-typo-fixes.
2016-01-11 11:29:15 -08:00
Josh Rosen 090d691323 [SPARK-4628][BUILD] Remove all non-Maven-Central repositories from build
This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository.

I tested this by running

```
build/mvn \
        -Phive \
        -Phive-thriftserver \
        -Pkinesis-asl \
        -Pspark-ganglia-lgpl \
        -Pyarn \
        dependency:go-offline
```

inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10659 from JoshRosen/SPARK-4628.
2016-01-08 20:58:53 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Shixiong Zhu c0c397509b [SPARK-12510][STREAMING] Refactor ActorReceiver to support Java
This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.
2016-01-07 15:26:55 -08:00
Marcelo Vanzin b3ba1be3b7 [SPARK-3873][TESTS] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
2016-01-05 19:07:39 -08:00
Marcelo Vanzin efb10cc9ad [SPARK-3873][STREAMING] Import order fixes for streaming.
Also included a few miscelaneous other modules that had very few violations.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10532 from vanzin/SPARK-3873-streaming.
2015-12-31 01:34:13 -08:00
Reynold Xin f496031bd2 Bump master version to 2.0.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #10387 from rxin/version-bump.
2015-12-19 15:13:05 -08:00
cody koeninger 48a9804b2a [SPARK-12103][STREAMING][KAFKA][DOC] document that K means Key and V …
…means Value

Author: cody koeninger <cody@koeninger.org>

Closes #10132 from koeninger/SPARK-12103.
2015-12-08 11:02:35 +00:00
Shixiong Zhu 3af53e61fd [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly
`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.

This patch fixed all places that use `ByteBuffer.array` incorrectly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10083 from zsxwing/bytebuffer-array.
2015-12-04 17:02:04 -08:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
microwishing 95b3cf125b [DOCUMENTATION][KAFKA] fix typo in kafka/OffsetRange.scala
this is to fix some typo in external/kafka/src/main/scala/org/apache/spark/streaming/kafka/OffsetRange.scala

Author: microwishing <wei.zhu@kaiyuandao.com>

Closes #10121 from microwishing/master.
2015-12-03 16:09:05 +00:00
Prashant Sharma bf0e85a70a [SPARK-12023][BUILD] Fix warnings while packaging spark with maven.
this is a trivial fix, discussed [here](http://stackoverflow.com/questions/28500401/maven-assembly-plugin-warning-the-assembly-descriptor-contains-a-filesystem-roo/).

Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #10014 from ScrapCodes/assembly-warning.
2015-11-30 10:11:27 +00:00
jerryshao 75a2922910 [SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API
Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.
2015-11-17 16:57:52 -08:00
Shixiong Zhu 3720b1480c [SPARK-11790][STREAMING][TESTS] Increase the connection timeout
Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9778 from zsxwing/SPARK-11790.
2015-11-17 15:47:39 -08:00
Shixiong Zhu ec80c0c2fc [SPARK-11706][STREAMING] Fix the bug that Streaming Python tests cannot report failures
This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9669 from zsxwing/streaming-python-tests.
2015-11-13 00:30:27 -08:00
Tathagata Das 6600786ddd [SPARK-11361][STREAMING] Show scopes of RDD operations inside DStream.foreachRDD and DStream.transform in DAG viz
Currently, when a DStream sets the scope for RDD generated by it, that scope is not allowed to be overridden by the RDD operations. So in case of `DStream.foreachRDD`, all the RDDs generated inside the foreachRDD get the same scope - `foreachRDD  <time>`, as set by the `ForeachDStream`. So it is hard to debug generated RDDs in the RDD DAG viz in the Spark UI.

This patch allows the RDD operations inside `DStream.transform` and `DStream.foreachRDD` to append their own scopes to the earlier DStream scope.

I have also slightly tweaked how callsites are set such that the short callsite reflects the RDD operation name and line number. This tweak is necessary as callsites are not managed through scopes (which support nesting and overriding) and I didnt want to add another local property to control nesting and overriding of callsites.

## Before:
![image](https://cloud.githubusercontent.com/assets/663212/10808548/fa71c0c4-7da9-11e5-9af0-5737793a146f.png)

## After:
![image](https://cloud.githubusercontent.com/assets/663212/10808659/37bc45b6-7dab-11e5-8041-c20be6a9bc26.png)

The code that was used to generate this is:
```
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.foreachRDD { rdd =>
      val temp = rdd.map { _ -> 1 }.reduceByKey( _ + _)
      val temp2 = temp.map { _ -> 1}.reduceByKey(_ + _)
      val count = temp2.count
      println(count)
    }
```

Note
- The inner scopes of the RDD operations map/reduceByKey inside foreachRDD is visible
- The short callsites of stages refers to the line number of the RDD ops rather than the same line number of foreachRDD in all three cases.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9315 from tdas/SPARK-11361.
2015-11-10 16:54:06 -08:00
dima e5bc8c2757 [SPARK-11245] update twitter4j to 4.0.4 version
update twitter4j to 4.0.4 version
https://issues.apache.org/jira/browse/SPARK-11245

Author: dima <pronix.service@gmail.com>

Closes #9221 from pronix/twitter4j_update.
2015-10-24 18:16:45 +01:00
Hari Shreedharan fa3e4d8f52 [SPARK-11019] [STREAMING] [FLUME] Gracefully shutdown Flume receiver th…
…reads.

Wait for a minute for the receiver threads to shutdown before interrupting them.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #9041 from harishreedharan/flume-graceful-shutdown.
2015-10-08 18:50:27 -07:00
Marcelo Vanzin 94fc57afdf [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8775 from vanzin/SPARK-10300.
2015-10-07 14:11:21 -07:00
Marcelo Vanzin b42059d2ef Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py."
This reverts commit 8abef21dac.
2015-09-15 13:03:38 -07:00
Marcelo Vanzin 8abef21dac [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.
This change does two things:

- tag a few tests and adds the mechanism in the build to be able to disable those tags,
  both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
  changed; that's used to disable expensive tests when a module hasn't explicitly
  been changed, to speed up testing for changes that don't directly affect those
  modules.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8437 from vanzin/test-tags.
2015-09-15 10:45:02 -07:00
Reynold Xin 09b7e7c198 Update version to 1.6.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
2015-09-15 00:54:20 -07:00
Sean Owen 22730ad54d [SPARK-10547] [TEST] Streamline / improve style of Java API tests
Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order

Author: Sean Owen <sowen@cloudera.com>

Closes #8706 from srowen/SPARK-10547.
2015-09-12 10:40:10 +01:00
Luc Bourlier c1bc4f439f [SPARK-10227] fatal warnings with sbt on Scala 2.11
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.

The remainder are some potential bugs, and deprecated syntax.

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #8433 from skyluc/issue/sbt-2.11.
2015-09-09 09:57:58 +01:00
Sean Owen 69c9c17716 [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.
2015-08-25 12:33:13 +01:00
cody koeninger d9c25dec87 [SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defa…
…ult maxRatePerPartition setting of 0

Author: cody koeninger <cody@koeninger.org>

Closes #8413 from koeninger/backpressure-testing-master.
2015-08-24 23:26:14 -07:00
Tathagata Das 7478c8b66d [SPARK-9791] [PACKAGE] Change private class to private class to prevent unnecessary classes from showing up in the docs
In addition, some random cleanup of import ordering

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8387 from tdas/SPARK-9791 and squashes the following commits:

67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs
2015-08-24 12:40:09 -07:00
zsxwing 4e0395ddb7 [SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars
This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build.

I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests.

Author: zsxwing <zsxwing@gmail.com>

Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits:

e0b5818 [zsxwing] Fix the sbt build
c697627 [zsxwing] Add the jar pathes to the exception message
be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
2015-08-24 12:38:01 -07:00
zsxwing bf1d6614dc [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars
Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars.

Author: zsxwing <zsxwing@gmail.com>

Closes #8069 from zsxwing/SPARK-9574.
2015-08-18 13:35:45 -07:00
cody koeninger 8ce60963cb [SPARK-9780] [STREAMING] [KAFKA] prevent NPE if KafkaRDD instantiation …
…fails

Author: cody koeninger <cody@koeninger.org>

Closes #8133 from koeninger/SPARK-9780 and squashes the following commits:

406259d [cody koeninger] [SPARK-9780][Streaming][Kafka] prevent NPE if KafkaRDD instantiation fails
2015-08-12 17:44:16 -07:00
Prabeesh K 853809e948 [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python
This PR is based on #4229, thanks prabeesh.

Closes #4229

Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>

Closes #7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
2015-08-10 16:33:23 -07:00
cody koeninger 1723e34893 [DOCS] [STREAMING] make the existing parameter docs for OffsetRange ac…
…tually visible

Author: cody koeninger <cody@koeninger.org>

Closes #7995 from koeninger/doc-fixes and squashes the following commits:

87af9ea [cody koeninger] [Docs][Streaming] make the existing parameter docs for OffsetRange actually visible
2015-08-06 14:37:25 -07:00
Tathagata Das 0a078303d0 [SPARK-9556] [SPARK-9619] [SPARK-9624] [STREAMING] Make BlockGenerator more robust and make all BlockGenerators subscribe to rate limit updates
In some receivers, instead of using the default `BlockGenerator` in `ReceiverSupervisorImpl`, custom generator with their custom listeners are used for reliability (see [`ReliableKafkaReceiver`](https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L99) and [updated `KinesisReceiver`](https://github.com/apache/spark/pull/7825/files)). These custom generators do not receive rate updates. This PR modifies the code to allow custom `BlockGenerator`s to be created through the `ReceiverSupervisorImpl` so that they can be kept track and rate updates can be applied.

In the process, I did some simplification, and de-flaki-fication of some rate controller related tests. In particular.
- Renamed `Receiver.executor` to `Receiver.supervisor` (to match `ReceiverSupervisor`)
- Made `RateControllerSuite` faster (by increasing batch interval) and less flaky
- Changed a few internal API to return the current rate of block generators as Long instead of Option\[Long\] (was inconsistent at places).
- Updated existing `ReceiverTrackerSuite` to test that custom block generators get rate updates as well.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #7913 from tdas/SPARK-9556 and squashes the following commits:

41d4461 [Tathagata Das] fix scala style
eb9fd59 [Tathagata Das] Updated kinesis receiver
d24994d [Tathagata Das] Updated BlockGeneratorSuite to use manual clock in BlockGenerator
d70608b [Tathagata Das] Updated BlockGenerator with states and proper synchronization
f6bd47e [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9556
31da173 [Tathagata Das] Fix bug
12116df [Tathagata Das] Add BlockGeneratorSuite
74bd069 [Tathagata Das] Fix style
989bb5c [Tathagata Das] Made BlockGenerator fail is used after stop, and added better unit tests for it
3ff618c [Tathagata Das] Fix test
b40eff8 [Tathagata Das] slight refactoring
f0df0f1 [Tathagata Das] Scala style fixes
51759cb [Tathagata Das] Refactored rate controller tests and added the ability to update rate of any custom block generator
2015-08-06 14:35:30 -07:00
Nilanjan Raychaudhuri a1bbf1bc5c [SPARK-8978] [STREAMING] Implements the DirectKafkaRateController
Author: Dean Wampler <dean@concurrentthought.com>
Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com>
Author: François Garillot <francois@garillot.net>

Closes #7796 from dragos/topic/streaming-bp/kafka-direct and squashes the following commits:

50d1f21 [Nilanjan Raychaudhuri] Taking care of the remaining nits
648c8b1 [Dean Wampler] Refactored rate controller test to be more predictable and run faster.
e43f678 [Nilanjan Raychaudhuri] fixing doc and nits
ce19d2a [Dean Wampler] Removing an unreliable assertion.
9615320 [Dean Wampler] Give me a break...
6372478 [Dean Wampler] Found a few ways to make this test more robust...
9e69e37 [Dean Wampler] Attempt to fix flakey test that fails in CI, but not locally :(
d3db1ea [Dean Wampler] Fixing stylecheck errors.
d04a288 [Nilanjan Raychaudhuri] adding test to make sure rate controller is used to calculate maxMessagesPerPartition
b6ecb67 [Nilanjan Raychaudhuri] Fixed styling issue
3110267 [Nilanjan Raychaudhuri] [SPARK-8978][Streaming] Implements the DirectKafkaRateController
393c580 [François Garillot] [SPARK-8978][Streaming] Implements the DirectKafkaRateController
51e78c6 [Nilanjan Raychaudhuri] Rename and fix build failure
2795509 [Nilanjan Raychaudhuri] Added missing RateController
19200f5 [Dean Wampler] Removed usage of infix notation. Changed a private variable name to be more consistent with usage.
aa4a70b [François Garillot] [SPARK-8978][Streaming] Implements the DirectKafkaController
2015-08-06 12:50:08 -07:00
Sean Owen 76d74090d6 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.

I'll explain several of the changes inline in comments.

Author: Sean Owen <sowen@cloudera.com>

Closes #7862 from srowen/SPARK-9534 and squashes the following commits:

ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
2015-08-04 12:02:26 +01:00
Josh Rosen b217230f2a [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Spark has an option called spark.localExecution.enabled; according to the docs:

> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.

This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.

This pull request simply brings #7484 up to date.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7585 from rxin/remove-local-exec and squashes the following commits:

84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
2015-07-22 21:04:04 -07:00
Josh Rosen 11e5c37286 [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses
This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly.  `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350)
<!-- Reviewable:end -->

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits:

e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
c0b7885 [Josh Rosen] Hopefully fix the last two cases
d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass
046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion.
d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName
2015-07-14 16:08:17 -07:00
Jonathan Alter e14b545d2d [SPARK-7977] [BUILD] Disallowing println
Author: Jonathan Alter <jonalter@users.noreply.github.com>

Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:

ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
10724b6 [Jonathan Alter] Changing some printlns to logs in tests
eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0b1dcb4 [Jonathan Alter] More println cleanup
aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0c16fa3 [Jonathan Alter] Replacing some printlns with logs
45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
5c8e283 [Jonathan Alter] Allowing println in audit-release examples
5b50da1 [Jonathan Alter] Allowing printlns in example files
ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
83ab635 [Jonathan Alter] Fixing new printlns
54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
b837c3a [Jonathan Alter] Disallowing println
2015-07-10 11:34:01 +01:00
Marcelo Vanzin 0e78e40c0b [SPARK-8852] [FLUME] Trim dependencies in flume assembly.
Also, add support for the *-provided profiles. This avoids repackaging
things that are already in the Spark assembly, or, in the case of the
*-provided profiles, are provided by the distribution.

The flume-ng-auth dependency was also excluded since it's not really
used by Spark.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7247 from vanzin/SPARK-8852 and squashes the following commits:

298a7d5 [Marcelo Vanzin] Feedback.
c962082 [Marcelo Vanzin] [SPARK-8852] [flume] Trim dependencies in flume assembly.
2015-07-09 18:23:06 -07:00
guowei2 897700369f [SPARK-8865] [STREAMING] FIX BUG: check key in kafka params
Author: guowei2 <guowei@growingio.com>

Closes #7254 from guowei2/spark-8865 and squashes the following commits:

48ca17a [guowei2] fix contains key
2015-07-09 15:01:53 -07:00
jerryshao 3ccebf36c5 [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python
This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like:

```
dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
```

Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange.

Author: jerryshao <saisai.shao@intel.com>

Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:

4c6d320 [jerryshao] Another way to fix subclass deserialization issue
e6a8011 [jerryshao] Address the comments
fd13937 [jerryshao] Fix serialization bug
7debf1c [jerryshao] bug fix
cff3893 [jerryshao] refactor the code according to the comments
2aabf9e [jerryshao] Style fix
848c708 [jerryshao] Add HasOffsetRanges for Python
2015-07-09 13:54:44 -07:00
zsxwing 1f6b0b1234 [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page
This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page.

For example,

![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)

FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges.

Author: zsxwing <zsxwing@gmail.com>

Closes #7081 from zsxwing/input-metadata and squashes the following commits:

f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
d906209 [zsxwing] Merge branch 'master' into input-metadata
74762da [zsxwing] Fix MiMa tests
7903e33 [zsxwing] Merge branch 'master' into input-metadata
450a46c [zsxwing] Address comments
1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any]
d496ae9 [zsxwing] Add input metadata in the batch page
2015-07-09 13:48:29 -07:00
jerryshao 8a9d9cc156 [SPARK-7050] [BUILD] Fix Python Kafka test assembly jar not found issue under Maven build
To fix Spark Streaming unit test with maven build. Previously the name and path of maven generated jar is different from sbt, which will lead to following exception. This fix keep the same behavior with both Maven and sbt build.

```
Failed to find Spark Streaming Kafka assembly jar in /home/xyz/spark/external/kafka-assembly
You need to build Spark with  'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or 'build/mvn package' before running this program
```

Author: jerryshao <saisai.shao@intel.com>

Closes #5632 from jerryshao/SPARK-7050 and squashes the following commits:

74b068d [jerryshao] Fix mvn build issue
2015-07-08 12:23:32 +01:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Hari Shreedharan 9b618fb0d2 [SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Si…
…nk. Also bump Flume version to 1.6.0

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the following commits:

9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0
ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 dependency from Flume Sink. Also bump Flume version to 1.6.0
2015-06-22 23:34:17 -07:00
cody koeninger 1b6fe9b1a7 [SPARK-8127] [STREAMING] [KAFKA] KafkaRDD optimize count() take() isEmpty()
…ed KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.

Author: cody koeninger <cody@koeninger.org>

Closes #6632 from koeninger/kafka-rdd-count and squashes the following commits:

321340d [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of ordering of take()
5a05d0f [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of isEmpty
f68bd32 [cody koeninger] [Streaming][Kafka][SPARK-8127] code cleanup
9555b73 [cody koeninger] Merge branch 'master' into kafka-rdd-count
253031d [cody koeninger] [Streaming][Kafka][SPARK-8127] mima exclusion for change to private method
8974b9e [cody koeninger] [Streaming][Kafka][SPARK-8127] check offset ranges before constructing KafkaRDD
c3768c5 [cody koeninger] [Streaming][Kafka] Take advantage of offset range info for size-related KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.
2015-06-19 18:54:07 -07:00
cody koeninger b305e377fb [SPARK-8390] [STREAMING] [KAFKA] fix docs related to HasOffsetRanges
Author: cody koeninger <cody@koeninger.org>

Closes #6863 from koeninger/SPARK-8390 and squashes the following commits:

26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390
3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs
b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing
bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 17:18:31 -07:00
cody koeninger 47af7c1ebf [SPARK-8389] [STREAMING] [KAFKA] Example of getting offset ranges out o…
…f the existing java direct stream api

Author: cody koeninger <cody@koeninger.org>

Closes #6846 from koeninger/SPARK-8389 and squashes the following commits:

3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 14:51:19 +02:00
zsxwing a06d9c8e76 [SPARK-8404] [STREAMING] [TESTS] Use thread-safe collections to make the tests more reliable
KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests.

This PR changes them to thread-safe collections.

Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable.

Author: zsxwing <zsxwing@gmail.com>

Closes #6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits:

d464211 [zsxwing] Use thread-safe collections to make the tests more reliable
2015-06-17 15:00:03 -07:00
cody koeninger b127ff8a0c [SPARK-2808] [STREAMING] [KAFKA] cleanup tests from
see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests

Author: cody koeninger <cody@koeninger.org>

Closes #5921 from koeninger/kafka-0.8.2-test-cleanup and squashes the following commits:

1e89dc8 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
4662828 [cody koeninger] [Streaming][Kafka] filter mima issue for removal of method from private test class
af1e083 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
4298ac2 [cody koeninger] [Streaming][Kafka] update comment to trigger jenkins attempt
1274afb [cody koeninger] [Streaming][Kafka] see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
2015-06-07 21:42:45 +01:00
Patrick Wendell 2c4d550eda [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
Author: Patrick Wendell <patrick@databricks.com>

Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:

2f42d02 [Patrick Wendell] A few more excludes
4bebcf0 [Patrick Wendell] Update to RC4
61aaf46 [Patrick Wendell] Using new release candidate
55f1610 [Patrick Wendell] Another exclude
04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
2015-06-03 10:11:27 -07:00
Marcelo Vanzin 0071bd8d31 [SPARK-8015] [FLUME] Remove Guava dependency from flume-sink.
The minimal change would be to disable shading of Guava in the module,
and rely on the transitive dependency from other libraries instead. But
since Guava's use is so localized, I think it's better to just not use
it instead, so I replaced that code and removed all traces of Guava from
the module's build.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6555 from vanzin/SPARK-8015 and squashes the following commits:

c0ceea8 [Marcelo Vanzin] Add comments about dependency management.
c38228d [Marcelo Vanzin] Add guava dep in test scope.
b7a0349 [Marcelo Vanzin] Add libthrift exclusion.
6e0942d [Marcelo Vanzin] Add comment in pom.
2d79260 [Marcelo Vanzin] [SPARK-8015] [flume] Remove Guava dependency from flume-sink.
2015-06-02 11:20:33 -07:00
Reynold Xin 564bc11e98 [SPARK-3850] Trim trailing spaces for examples/streaming/yarn.
Author: Reynold Xin <rxin@databricks.com>

Closes #6530 from rxin/trim-whitespace-1 and squashes the following commits:

7b7b3a0 [Reynold Xin] Reset again.
dc14597 [Reynold Xin] Reset scalastyle.
cd556c4 [Reynold Xin] YARN, Kinesis, Flume.
4223fe1 [Reynold Xin] [SPARK-3850] Trim trailing spaces for examples/streaming.
2015-05-31 00:47:56 -07:00
Andrew Or 609c4923f9 [SPARK-7558] Guard against direct uses of FunSuite / FunSuiteLike
This is a follow-up patch to #6441.

Author: Andrew Or <andrew@databricks.com>

Closes #6510 from andrewor14/extends-funsuite-check and squashes the following commits:

6618b46 [Andrew Or] Exempt SparkSinkSuite from the FunSuite check
99d02ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into extends-funsuite-check
48874dd [Andrew Or] Guard against direct uses of FunSuite / FunSuiteLike
2015-05-29 22:57:46 -07:00
Andrew Or a4f24123d8 [HOT FIX] [BUILD] Fix maven build failures
This patch fixes a build break in maven caused by #6441.

Note that this patch reverts the changes in flume-sink because
this module does not currently depend on Spark core, but the
tests require it. There is not an easy way to make this work
because mvn test dependencies are not transitive (MNG-1378).

For now, we will leave the one test suite in flume-sink out
until we figure out a better solution. This patch is mainly
intended to unbreak the maven build.

Author: Andrew Or <andrew@databricks.com>

Closes #6511 from andrewor14/fix-build-mvn and squashes the following commits:

3d53643 [Andrew Or] [HOT FIX #6441] Fix maven build failures
2015-05-29 17:19:46 -07:00
Andrew Or 9eb222c139 [SPARK-7558] Demarcate tests in unit-tests.log
Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.:

```
===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' =====

15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents

...

15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s
15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite:

***** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *****

...
```

Author: Andrew Or <andrew@databricks.com>

Closes #6441 from andrewor14/demarcate-tests and squashes the following commits:

879b060 [Andrew Or] Fix compile after rebase
d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
7790b6c [Andrew Or] Fix tests after logical merge conflict
c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
c43ffc4 [Andrew Or] Fix tests?
8882581 [Andrew Or] Fix tests
ee22cda [Andrew Or] Fix log message
fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
12d1e1b [Andrew Or] Various whitespace changes (minor)
69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
bbce12e [Andrew Or] Fix manual things that cannot be covered through automation
da0b12f [Andrew Or] Add core tests as dependencies in all modules
f7d29ce [Andrew Or] Introduce base abstract class for all test suites
2015-05-29 14:03:12 -07:00
Reynold Xin 97a60cf75d [SPARK-7929] Turn whitespace checker on for more token types.
This is the last batch of changes to complete SPARK-7929.

Previous related PRs:
https://github.com/apache/spark/pull/6480
https://github.com/apache/spark/pull/6478
https://github.com/apache/spark/pull/6477
https://github.com/apache/spark/pull/6476
https://github.com/apache/spark/pull/6475
https://github.com/apache/spark/pull/6474
https://github.com/apache/spark/pull/6473

Author: Reynold Xin <rxin@databricks.com>

Closes #6487 from rxin/whitespace-lint and squashes the following commits:

b33d43d [Reynold Xin] [SPARK-7929] Turn whitespace checker on for more token types.
2015-05-28 23:00:02 -07:00
jerluc 0a7a94eab5 [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners
PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s.

Author: jerluc <jeremyalucas@gmail.com>

Closes #6204 from jerluc/master and squashes the following commits:

82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners
2015-05-18 18:13:29 -07:00
Andrew Or b93c97d79b [SPARK-7501] [STREAMING] DAG visualization: show DStream operations
This is similar to #5999, but for streaming. Roughly 200 lines are tests.

One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way.

tdas zsxwing

------------------------
**Before**
<img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/>

--------------------------
**After**
<img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits:

932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
e685df9 [Andrew Or] Rename createRDDWith
84d0656 [Andrew Or] Review feedback
697c086 [Andrew Or] Fix tests
53b9936 [Andrew Or] Set scopes for foreachRDD properly
1881802 [Andrew Or] Refactor DStream scope names again
af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
fd07d22 [Andrew Or] Make MQTT lower case
f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases
fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within
1af0b0e [Andrew Or] Fix style
074c00b [Andrew Or] Review comments
d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
e4a93ac [Andrew Or] Fix tests?
25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
9113183 [Andrew Or] Add tests for DStream scopes
b3806ab [Andrew Or] Fix test
bb80bbb [Andrew Or] Fix MIMA?
5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
5703939 [Andrew Or] Rename operations that create InputDStreams
7c4513d [Andrew Or] Group RDDs by DStream operations and batches
bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
05c2676 [Andrew Or] Wrap many more methods in withScope
c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
65ef3e9 [Andrew Or] Fix NPE
a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations
2015-05-18 14:33:33 -07:00
Hari Shreedharan 61d1e87c0d [SPARK-7356] [STREAMING] Fix flakey tests in FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
This is meant to make the FlumePollingStreamSuite deterministic. Now we basically count the number of batches that have been completed - and then verify the results rather than sleeping for random periods of time.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #5918 from harishreedharan/flume-test-fix and squashes the following commits:

93f24f3 [Hari Shreedharan] Add an eventually block to ensure that all received data is processed. Refactor the dstream creation and remove redundant code.
1108804 [Hari Shreedharan] [SPARK-7356][STREAMING] Fix flakey tests in FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
2015-05-13 16:43:30 -07:00
jerryshao 8436f7e98e [SPARK-7113] [STREAMING] Support input information reporting for Direct Kafka stream
Author: jerryshao <saisai.shao@intel.com>

Closes #5879 from jerryshao/SPARK-7113 and squashes the following commits:

b0b506c [jerryshao] Address the comments
0babe66 [jerryshao] Support input information reporting for Direct Kafka stream
2015-05-05 02:01:06 -07:00
Tathagata Das 8776fe0b93 [HOTFIX] [TEST] Ignoring flaky tests
org.apache.spark.DriverSuite.driver should exit after finishing without cleanup (SPARK-530)
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2267/

org.apache.spark.deploy.SparkSubmitSuite.includes jars passed in through --jars
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/

org.apache.spark.streaming.flume.FlumePollingStreamSuite.flume polling test
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2269/

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5901 from tdas/ignore-flaky-tests and squashes the following commits:

9cd8667 [Tathagata Das] Ignoring tests.
2015-05-05 01:58:51 -07:00
cody koeninger 4786484076 [SPARK-2808][Streaming][Kafka] update kafka to 0.8.2
i don't think this should be merged until after 1.3.0 is final

Author: cody koeninger <cody@koeninger.org>
Author: Helena Edelson <helena.edelson@datastax.com>

Closes #4537 from koeninger/wip-2808-kafka-0.8.2-upgrade and squashes the following commits:

803aa2c [cody koeninger] [SPARK-2808][Streaming][Kafka] code cleanup per TD
e6dfaf6 [cody koeninger] [SPARK-2808][Streaming][Kafka] pointless whitespace change to trigger jenkins again
1770abc [cody koeninger] [SPARK-2808][Streaming][Kafka] make waitUntilLeaderOffset easier to call, call it from python tests as well
d4267e9 [cody koeninger] [SPARK-2808][Streaming][Kafka] fix stderr redirect in python test script
30d991d [cody koeninger] [SPARK-2808][Streaming][Kafka] remove stderr prints since it breaks python 3 syntax
1d896e2 [cody koeninger] [SPARK-2808][Streaming][Kafka] add even even more logging to python test
4c4557f [cody koeninger] [SPARK-2808][Streaming][Kafka] add even more logging to python test
115aeee [cody koeninger] Merge branch 'master' into wip-2808-kafka-0.8.2-upgrade
2712649 [cody koeninger] [SPARK-2808][Streaming][Kafka] add more logging to python test, see why its timing out in jenkins
2b92d3f [cody koeninger] [SPARK-2808][Streaming][Kafka] wait for leader offsets in the java test as well
3824ce3 [cody koeninger] [SPARK-2808][Streaming][Kafka] naming / comments per tdas
61b3464 [cody koeninger] [SPARK-2808][Streaming][Kafka] delay for second send in boundary condition test
af6f3ec [cody koeninger] [SPARK-2808][Streaming][Kafka] delay test until latest leader offset matches expected value
9edab4c [cody koeninger] [SPARK-2808][Streaming][Kafka] more shots in the dark on jenkins failing test
c70ee43 [cody koeninger] [SPARK-2808][Streaming][Kafka] add more asserts to test, try to figure out why it fails on jenkins but not locally
1d10751 [cody koeninger] Merge branch 'master' into wip-2808-kafka-0.8.2-upgrade
ed02d2c [cody koeninger] [SPARK-2808][Streaming][Kafka] move default argument for api version to overloaded method, for binary compat
407382e [cody koeninger] [SPARK-2808][Streaming][Kafka] update kafka to 0.8.2.1
77de6c2 [cody koeninger] Merge branch 'master' into wip-2808-kafka-0.8.2-upgrade
6953429 [cody koeninger] [SPARK-2808][Streaming][Kafka] update kafka to 0.8.2
2e67c66 [Helena Edelson] #SPARK-2808 Update to Kafka 0.8.2.0 GA from beta.
d9dc2bc [Helena Edelson] Merge remote-tracking branch 'upstream/master' into wip-2808-kafka-0.8.2-upgrade
e768164 [Helena Edelson] #2808 update kafka to version 0.8.2
2015-05-01 17:54:56 -07:00
Tathagata Das 1868bd40dc [SPARK-7056] [STREAMING] Make the Write Ahead Log pluggable
Users may want the WAL data to be written to non-HDFS data storage systems. To allow that, we have to make the WAL pluggable. The following design doc outlines the plan.

https://docs.google.com/a/databricks.com/document/d/1A2XaOLRFzvIZSi18i_luNw5Rmm9j2j4AigktXxIYxmY/edit?usp=sharing

Things to add.
* Unit tests for WriteAheadLogUtils

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5645 from tdas/wal-pluggable and squashes the following commits:

2c431fd [Tathagata Das] Minor fixes.
c2bc7384 [Tathagata Das] More changes based on PR comments.
569a416 [Tathagata Das] fixed long line
bde26b1 [Tathagata Das] Renamed segment to record handle everywhere
b65e155 [Tathagata Das] More changes based on PR comments.
d7cd15b [Tathagata Das] Fixed test
1a32a4b [Tathagata Das] Fixed test
e0d19fb [Tathagata Das] Fixed defaults
9310cbf [Tathagata Das] style fix.
86abcb1 [Tathagata Das] Refactored WriteAheadLogUtils, and consolidated all WAL related configuration into it.
84ce469 [Tathagata Das] Added unit test and fixed compilation error.
bce5e75 [Tathagata Das] Fixed long lines.
837c4f5 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into wal-pluggable
754fbf8 [Tathagata Das] Added license and docs.
09bc6fe [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into wal-pluggable
7dd2d4b [Tathagata Das] Added pluggable WriteAheadLog interface, and refactored all code along with it
2015-04-29 13:06:11 -07:00
jerryshao 9e4e82b7bc [SPARK-5946] [STREAMING] Add Python API for direct Kafka stream
Currently only added `createDirectStream` API, I'm not sure if `createRDD` is also needed, since some Java object needs to be wrapped in Python. Please help to review, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>
Author: Saisai Shao <saisai.shao@intel.com>

Closes #4723 from jerryshao/direct-kafka-python-api and squashes the following commits:

a1fe97c [jerryshao] Fix rebase issue
eebf333 [jerryshao] Address the comments
da40f4e [jerryshao] Fix Python 2.6 Syntax error issue
5c0ee85 [jerryshao] Style fix
4aeac18 [jerryshao] Fix bug in example code
7146d86 [jerryshao] Add unit test
bf3bdd6 [jerryshao] Add more APIs and address the comments
f5b3801 [jerryshao] Small style fix
8641835 [Saisai Shao] Rebase and update the code
589c05b [Saisai Shao] Fix the style
d6fcb6a [Saisai Shao] Address the comments
dfda902 [Saisai Shao] Style fix
0f7d168 [Saisai Shao] Add the doc and fix some style issues
67e6880 [Saisai Shao] Fix test bug
917b0db [Saisai Shao] Add Python createRDD API for Kakfa direct stream
c3fc11d [jerryshao] Modify the docs
2c00936 [Saisai Shao] address the comments
3360f44 [jerryshao] Fix code style
e0e0f0d [jerryshao] Code clean and bug fix
338c41f [Saisai Shao] Add python API and example for direct kafka stream
2015-04-27 23:48:02 -07:00
Sean Owen ab5adb7a97 [SPARK-7145] [CORE] commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava

Author: Sean Owen <sowen@cloudera.com>

Closes #5703 from srowen/SPARK-7145 and squashes the following commits:

21fbe03 [Sean Owen] Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
2015-04-27 19:50:55 -04:00
zsxwing 33b85620f9 [SPARK-7052][Core] Add ThreadUtils and move thread methods from Utils to ThreadUtils
As per rxin 's suggestion in https://github.com/apache/spark/pull/5392/files#r28757176

What's more, there is a race condition in the global shared `daemonThreadFactoryBuilder`. `daemonThreadFactoryBuilder` may be modified by multiple threads. This PR removed the global `daemonThreadFactoryBuilder` and created a new `ThreadFactoryBuilder` every time.

Author: zsxwing <zsxwing@gmail.com>

Closes #5631 from zsxwing/thread-utils and squashes the following commits:

9fe5b0e [zsxwing] Add ThreadUtils and move thread methods from Utils to ThreadUtils
2015-04-22 11:08:59 -07:00
jerryshao 8370550593 [Streaming][minor] Remove additional quote and unneeded imports
Author: jerryshao <saisai.shao@intel.com>

Closes #5540 from jerryshao/minor-fix and squashes the following commits:

ebaa646 [jerryshao] Minor fix
2015-04-16 10:39:02 +01:00
cody koeninger 6ac8eea2fc [SPARK-6431][Streaming][Kafka] Error message for partition metadata requ...
...ests

The original reported problem was misdiagnosed; the topic just didn't exist yet.  Agreed upon solution was to improve error handling / message

Author: cody koeninger <cody@koeninger.org>

Closes #5454 from koeninger/spark-6431-master and squashes the following commits:

44300f8 [cody koeninger] [SPARK-6431][Streaming][Kafka] Error message for partition metadata requests
2015-04-12 17:37:30 +01:00
jerryshao 3290d2d13b [SPARK-6211][Streaming] Add Python Kafka API unit test
Refactor the Kafka unit test and add Python API support. CC tdas davies please help to review, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>
Author: Saisai Shao <saisai.shao@intel.com>

Closes #4961 from jerryshao/SPARK-6211 and squashes the following commits:

ee4b919 [jerryshao] Fixed newly merged issue
82c756e [jerryshao] Address the comments
92912d1 [jerryshao] Address the commits
0708bb1 [jerryshao] Fix rebase issue
40b47a3 [Saisai Shao] Style fix
f889657 [Saisai Shao] Update the code according
8a2f3e2 [jerryshao] Address the issues
0f1b7ce [jerryshao] Still fix the bug
61a04f0 [jerryshao] Fix bugs and address the issues
64d9877 [jerryshao] Fix rebase bugs
8ad442f [jerryshao] Add kafka-assembly in run-tests
6020b00 [jerryshao] Add more debug info in Shell
8102d6e [jerryshao] Fix bug in Jenkins test
fde1213 [jerryshao] Code style changes
5536f95 [jerryshao] Refactor the Kafka unit test and add Python Kafka unittest support
2015-04-09 23:14:24 -07:00
WangTaoTheTonic 7d92db342e [SPARK-6758]block the right jetty package in log
https://issues.apache.org/jira/browse/SPARK-6758

I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:

e09605b [WangTaoTheTonic] block the right jetty package
2015-04-09 17:44:08 -04:00
Reynold Xin 15e0d2bd13 [SPARK-6765] Fix test code style for streaming.
So we can turn style checker on for test code.

Author: Reynold Xin <rxin@databricks.com>

Closes #5409 from rxin/test-style-streaming and squashes the following commits:

7aea69b [Reynold Xin] [SPARK-6765] Fix test code style for streaming.
2015-04-08 00:24:59 -07:00
Sean Owen 9fe4125219 SPARK-6569 [STREAMING] Down-grade same-offset message in Kafka streaming to INFO
Reduce "is the same as ending offset" message to INFO level per JIRA discussion

Author: Sean Owen <sowen@cloudera.com>

Closes #5366 from srowen/SPARK-6569 and squashes the following commits:

8a5b992 [Sean Owen] Reduce "is the same as ending offset" message to INFO level per JIRA discussion
2015-04-06 10:18:56 +01:00
Reynold Xin 82701ee25f [SPARK-6428] Turn on explicit type checking for public methods.
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.

Author: Reynold Xin <rxin@databricks.com>

Closes #5342 from rxin/SPARK-6428 and squashes the following commits:

7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
2015-04-03 01:25:02 -07:00
Kousuke Saruta 85cf063682 [SPARK-5559] [Streaming] [Test] Remove oppotunity we met flakiness when running FlumeStreamSuite
When we run FlumeStreamSuite on Jenkins, sometimes we get error like as follows.

    sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 52 times over 10.094849836 seconds. Last failure message: Error connecting to localhost/127.0.0.1:23456.
	    at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
	    at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
	    at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	    at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
	   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	   at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:116)
           at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
	   at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply$mcV$sp(FlumeStreamSuite.scala:66)
	    at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
	    at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
	    at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	    at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	    at org.scalatest.Transformer.apply(Transformer.scala:22)
	    at org.scalatest.Transformer.apply(Transformer.scala:20)
    	    at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	    at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
	    at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
	    at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	    at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	    at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)

This error is caused by check-then-act logic  when it find free-port .

      /** Find a free port */
      private def findFreePort(): Int = {
        Utils.startServiceOnPort(23456, (trialPort: Int) => {
          val socket = new ServerSocket(trialPort)
          socket.close()
          (null, trialPort)
        }, conf)._2
      }

Removing the check-then-act is not easy but we can reduce the chance of having the error by choosing random value for initial port instead of 23456.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #4337 from sarutak/SPARK-5559 and squashes the following commits:

16f109f [Kousuke Saruta] Added `require` to Utils#startServiceOnPort
c39d8b6 [Kousuke Saruta] Merge branch 'SPARK-5559' of github.com:sarutak/spark into SPARK-5559
1610ba2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
33357e3 [Kousuke Saruta] Changed "findFreePort" method in MQTTStreamSuite and FlumeStreamSuite so that it can choose valid random port
a9029fe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
9489ef9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
8212e42 [Kousuke Saruta] Modified default port used in FlumeStreamSuite from 23456 to random value
2015-03-24 16:20:52 +00:00
Marcelo Vanzin a74564591f [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:

63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
178ba71 [Marcelo Vanzin] Oops.
75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
a45a62c [Marcelo Vanzin] Work around MIMA warning.
1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
cef4603 [Marcelo Vanzin] Indentation.
296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
2015-03-20 18:43:57 +00:00
Sean Owen 6f80c3e888 SPARK-6338 [CORE] Use standard temp dir mechanisms in tests to avoid orphaned temp files
Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify

Author: Sean Owen <sowen@cloudera.com>

Closes #5029 from srowen/SPARK-6338 and squashes the following commits:

27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir
4a212fa [Sean Owen] Standardize a bit more temp dir management
9004081 [Sean Owen] Revert some added recursive-delete calls
57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
2015-03-20 14:16:21 +00:00
Sean Owen 6e94c4eadf SPARK-6225 [CORE] [SQL] [STREAMING] Resolve most build warnings, 1.3.0 edition
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.

Author: Sean Owen <sowen@cloudera.com>

Closes #4950 from srowen/SPARK-6225 and squashes the following commits:

3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
2015-03-11 13:15:19 +00:00
zzcclp ec30c17822 [SPARK-6279][Streaming]In KafkaRDD.scala, Miss expressions flag "s" at logging string
In KafkaRDD.scala, Miss expressions flag "s" at logging string
In logging file, it print `Beginning offset $
{part.fromOffset}
is the same as ending offset ` but not `Beginning offset 111 is the same as ending offset `.

Author: zzcclp <xm_zzc@sina.com>

Closes #4979 from zzcclp/SPARK-6279 and squashes the following commits:

768f88e [zzcclp] Miss expressions flag "s"
2015-03-11 12:22:24 +00:00
Sean Owen c9cfba0ceb SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11

Author: Sean Owen <sowen@cloudera.com>

Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:

eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
2015-03-05 11:31:48 -08:00
Saisai Shao 5f7f3b938e [Streaming][Minor] Remove useless type signature of Java Kafka direct stream API
cc tdas .

Author: Saisai Shao <saisai.shao@intel.com>

Closes #4817 from jerryshao/signature-minor-fix and squashes the following commits:

eebfaac [Saisai Shao] Remove useless type parameter
2015-02-27 13:01:42 -08:00
Tathagata Das aa63f633d3 [SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils and improved error message
The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka

Also improves the error message.

davies

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4779 from tdas/kafka-python-fix and squashes the following commits:

fb16b04 [Tathagata Das] Removed import
c1fdf35 [Tathagata Das] Fixed long line and improved documentation
7b88be8 [Tathagata Das] Fixed --jar not working for KafkaUtils and improved error message
2015-02-26 13:47:07 -08:00
prabs d51ed263ee [SPARK-5666][streaming][MQTT streaming] some trivial fixes
modified to adhere to accepted coding standards as pointed by tdas in PR #3844

Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabsmails@gmail.com>

Closes #4178 from prabeesh/master and squashes the following commits:

bd2cb49 [Prabeesh K] adress the comment
ccc0765 [prabs] adress the comment
46f9619 [prabs] adress the comment
c035bdc [prabs] adress the comment
22dd7f7 [prabs] address the comments
0cc67bd [prabs] adress the comment
838c38e [prabs] adress the comment
cd57029 [prabs] address the comments
66919a3 [Prabeesh K] changed MqttDefaultFilePersistence to MemoryPersistence
5857989 [prabs] modified to adhere to accepted coding standards
2015-02-25 14:37:35 +00:00
Tathagata Das 922b43b3cc [SPARK-5993][Streaming][Build] Fix assembly jar location of kafka-assembly
Published Kafka-assembly JAR was empty in 1.3.0-RC1
This is because the maven build generated two Jars-
1. an empty JAR file (since kafka-assembly has no code of its own)
2. a assembly JAR file containing everything in a different location as 1
The maven publishing plugin uploaded 1 and not 2.
Instead if 2 is not configure to generate in a different location, there is only 1 jar containing everything, which gets published.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4753 from tdas/SPARK-5993 and squashes the following commits:

c390db8 [Tathagata Das] Fix assembly jar location of kafka-assembly
2015-02-24 19:10:37 -08:00
Sean Owen 34b7c35380 SPARK-4682 [CORE] Consolidate various 'Clock' classes
Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.

Author: Sean Owen <sowen@cloudera.com>

Closes #4514 from srowen/SPARK-4682 and squashes the following commits:

5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
2015-02-19 15:35:23 -08:00
Tathagata Das 3912d33246 [SPARK-5731][Streaming][Test] Fix incorrect test in DirectKafkaStreamSuite
The test was incorrect. Instead of counting the number of records, it counted the number of partitions of RDD generated by DStream. Which is not its intention. I will be testing this patch multiple times to understand its flakiness.

PS: This was caused by my refactoring in https://github.com/apache/spark/pull/4384/

koeninger check it out.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4597 from tdas/kafka-flaky-test and squashes the following commits:

d236235 [Tathagata Das] Unignored last test.
e9a1820 [Tathagata Das] fix test
2015-02-17 22:44:16 -08:00
Patrick Wendell a51d51ffac SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream
Author: Patrick Wendell <patrick@databricks.com>

Closes #4638 from pwendell/SPARK-5850 and squashes the following commits:

386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
2015-02-16 20:33:33 -08:00
Reynold Xin 378c7eb0d6 [HOTFIX] Ignore DirectKafkaStreamSuite. 2015-02-13 12:43:53 -08:00
Sean Owen da89720bf4 SPARK-5728 [STREAMING] MQTTStreamSuite leaves behind ActiveMQ database files
Use temp dir for ActiveMQ database

Author: Sean Owen <sowen@cloudera.com>

Closes #4517 from srowen/SPARK-5728 and squashes the following commits:

1d3aeb8 [Sean Owen] Use temp dir for ActiveMQ database
2015-02-11 08:13:51 +00:00
cody koeninger 658687b254 [SPARK-4964] [Streaming] refactor createRDD to take leaders via map instead of array
Author: cody koeninger <cody@koeninger.org>

Closes #4511 from koeninger/kafkaRdd-leader-to-broker and squashes the following commits:

f7151d4 [cody koeninger] [SPARK-4964] test refactoring
6f8680b [cody koeninger] [SPARK-4964] add test of the scala api for KafkaUtils.createRDD
f81e016 [cody koeninger] [SPARK-4964] leave KafkaStreamSuite host and port as private
5173f3f [cody koeninger] [SPARK-4964] test the Java variations of createRDD
e9cece4 [cody koeninger] [SPARK-4964] pass leaders as a map to ensure 1 leader per TopicPartition
2015-02-11 00:13:27 -08:00
Tathagata Das c15134632e [SPARK-4964][Streaming][Kafka] More updates to Exactly-once Kafka stream
Changes
- Added example
- Added a critical unit test that verifies that offset ranges can be recovered through checkpoints

Might add more changes.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4384 from tdas/new-kafka-fixes and squashes the following commits:

7c931c3 [Tathagata Das] Small update
3ed9284 [Tathagata Das] updated scala doc
83d0402 [Tathagata Das] Added JavaDirectKafkaWordCount example.
26df23c [Tathagata Das] Updates based on PR comments from Cody
e4abf69 [Tathagata Das] Scala doc improvements and stuff.
bb65232 [Tathagata Das] Fixed test bug and refactored KafkaStreamSuite
50f2b56 [Tathagata Das] Added Java API and added more Scala and Java unit tests. Also updated docs.
e73589c [Tathagata Das] Minor changes.
4986784 [Tathagata Das] Added unit test to kafka offset recovery
6a91cab [Tathagata Das] Added example
2015-02-09 22:45:48 -08:00
Hari Shreedharan 0765af9b21 [SPARK-4905][STREAMING] FlumeStreamSuite fix.
Using String constructor instead of CharsetDecoder to see if it fixes the issue of empty strings in Flume test output.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #4371 from harishreedharan/Flume-stream-attempted-fix and squashes the following commits:

550d363 [Hari Shreedharan] Fix imports.
8695950 [Hari Shreedharan] Use Charsets.UTF_8 instead of "UTF-8" in String constructors.
af3ba14 [Hari Shreedharan] [SPARK-4905][STREAMING] FlumeStreamSuite fix.
2015-02-09 14:17:14 -08:00
Hari Shreedharan f0500f9fa3 [SPARK-4707][STREAMING] Reliable Kafka Receiver can lose data if the blo...
...ck generator fails to store data.

The Reliable Kafka Receiver commits offsets only when events are actually stored, which ensures that on restart we will actually start where we left off. But if the failure happens in the store() call, and the block generator reports an error the receiver does not do anything and will continue reading from the current offset and not the last commit. This means that messages between the last commit and the current offset will be lost.

This PR retries the store call four times and then stops the receiver with an error message and the last exception that was received from the store.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #3655 from harishreedharan/kafka-failure-fix and squashes the following commits:

5e2e7ad [Hari Shreedharan] [SPARK-4704][STREAMING] Reliable Kafka Receiver can lose data if the block generator fails to store data.
2015-02-04 14:20:44 -08:00
cody koeninger b0c0021953 [SPARK-4964] [Streaming] Exactly-once semantics for Kafka
Author: cody koeninger <cody@koeninger.org>

Closes #3798 from koeninger/kafkaRdd and squashes the following commits:

1dc2941 [cody koeninger] [SPARK-4964] silence ConsumerConfig warnings about broker connection props
59e29f6 [cody koeninger] [SPARK-4964] settle on "Direct" as a naming convention for the new stream
8c31855 [cody koeninger] [SPARK-4964] remove HasOffsetRanges interface from return types
0df3ebe [cody koeninger] [SPARK-4964] add comments per pwendell / dibbhatt
8991017 [cody koeninger] [SPARK-4964] formatting
825110f [cody koeninger] [SPARK-4964] rename stuff per TD
4354bce [cody koeninger] [SPARK-4964] per td, remove java interfaces, replace with final classes, corresponding changes to KafkaRDD constructor and checkpointing
9adaa0a [cody koeninger] [SPARK-4964] formatting
0090553 [cody koeninger] [SPARK-4964] javafication of interfaces
9a838c2 [cody koeninger] [SPARK-4964] code cleanup, add more tests
2b340d8 [cody koeninger] [SPARK-4964] refactor per TD feedback
80fd6ae [cody koeninger] [SPARK-4964] Rename createExactlyOnceStream so it isnt over-promising, change doc
99d2eba [cody koeninger] [SPARK-4964] Reduce level of nesting.  If beginning is past end, its actually an error (may happen if Kafka topic was deleted and recreated)
19406cc [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
2e67117 [cody koeninger] [SPARK-4964] one potential way of hiding most of the implementation, while still allowing access to offsets (but not subclassing)
bb80bbe [cody koeninger] [SPARK-4964] scalastyle line length
d4a7cf7 [cody koeninger] [SPARK-4964] allow for use cases that need to override compute for custom kafka dstreams
c1bd6d9 [cody koeninger] [SPARK-4964] use newly available attemptNumber for correct retry behavior
548d529 [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
0458e4e [cody koeninger] [SPARK-4964] recovery of generated rdds from checkpoint
e86317b [cody koeninger] [SPARK-4964] try seed brokers in random order to spread metadata requests
e93eb72 [cody koeninger] [SPARK-4964] refactor to add preferredLocations.  depends on SPARK-4014
356c7cc [cody koeninger] [SPARK-4964] code cleanup per helena
adf99a6 [cody koeninger] [SPARK-4964] fix serialization issues for checkpointing
1d50749 [cody koeninger] [SPARK-4964] code cleanup per tdas
8bfd6c0 [cody koeninger] [SPARK-4964] configure rate limiting via spark.streaming.receiver.maxRate
e09045b [cody koeninger] [SPARK-4964] add foreachPartitionWithIndex, to avoid doing equivalent map + empty foreach boilerplate
cac63ee [cody koeninger] additional testing, fix fencepost error
37d3053 [cody koeninger] make KafkaRDDPartition available to users so offsets can be committed per partition
bcca8a4 [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
6bf14f2 [cody koeninger] first attempt at a Kafka dstream that allows for exactly-once semantics
326ff3c [cody koeninger] add some tests
38bb727 [cody koeninger] give easy access to the parameters of a KafkaRDD
979da25 [cody koeninger] dont allow empty leader offsets to be returned
8d7de4a [cody koeninger] make sure leader offsets can be found even for leaders that arent in the seed brokers
4b078bf [cody koeninger] differentiate between leader and consumer offsets in error message
3c2a96a [cody koeninger] fix scalastyle errors
29c6b43 [cody koeninger] cleanup logging
783b477 [cody koeninger] update tests for kafka 8.1.1
7d050bc [cody koeninger] methods to set consumer offsets and get topic metadata, switch back to inclusive start / exclusive end to match typical kafka consumer behavior
ce91c59 [cody koeninger] method to get consumer offsets, explicit error handling
4dafd1b [cody koeninger] method to get leader offsets, switch rdd bound to being exclusive start, inclusive end to match offsets typically returned from cluster
0b94b33 [cody koeninger] use dropWhile rather than filter to trim beginning of fetch response
1d70625 [cody koeninger] WIP on kafka cluster
76913e2 [cody koeninger] Batch oriented kafka rdd, WIP. todo: cluster metadata / finding leader
2015-02-04 12:06:34 -08:00
Tathagata Das 681f9df47f [SPARK-5153][Streaming][Test] Increased timeout to deal with flaky KafkaStreamSuite
Timeout increased to allow overloaded Jenkins to cope with delay in topic creation.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4342 from tdas/SPARK-5153 and squashes the following commits:

dc42762 [Tathagata Das] Increased timeout to deal with delays in overloaded Jenkins.
2015-02-03 13:46:02 -08:00
Davies Liu 0561c45449 [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python
This PR brings the Python API for Spark Streaming Kafka data source.

```
    class KafkaUtils(__builtin__.object)
     |  Static methods defined here:
     |
     |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
     |      Create an input stream that pulls messages from a Kafka Broker.
     |
     |      :param ssc:  StreamingContext object
     |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
     |      :param groupId:  The group id for this consumer.
     |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
     |                      Each partition is consumed in its own thread.
     |      :param storageLevel:  RDD storage level.
     |      :param keyDecoder:  A function used to decode key
     |      :param valueDecoder:  A function used to decode value
     |      :return: A DStream object
```
run the example:

```
bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
```

Author: Davies Liu <davies@databricks.com>
Author: Tathagata Das <tdas@databricks.com>

Closes #3715 from davies/kafka and squashes the following commits:

d93bfe0 [Davies Liu] Update make-distribution.sh
4280d04 [Davies Liu] address comments
e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
f257071 [Davies Liu] add tests for null in RDD
23b039a [Davies Liu] address comments
9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
a74da87 [Davies Liu] address comments
dc1eed0 [Davies Liu] Update kafka_wordcount.py
31e2317 [Davies Liu] Update kafka_wordcount.py
370ba61 [Davies Liu] Update kafka.py
97386b3 [Davies Liu] address comment
2c567a5 [Davies Liu] update logging and comment
33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
aea8953 [Tathagata Das] Kafka-assembly for Python API
eea16a7 [Davies Liu] refactor
f6ce899 [Davies Liu] add example and fix bugs
98c8d17 [Davies Liu] fix python style
5697a01 [Davies Liu] bypass decoder in scala
048dbe6 [Davies Liu] fix python style
75d485e [Davies Liu] add mqtt
07923c4 [Davies Liu] support kafka in Python
2015-02-02 19:16:27 -08:00
Iulian Dragos e908322cd5 [SPARK-4631][streaming][FIX] Wait for a receiver to start before publishing test data.
This fixes two sources of non-deterministic failures in this test:

- wait for a receiver to be up before pushing data through MQTT
- gracefully handle the case where the MQTT client is overloaded. There’s
a hard-coded limit of 10 in-flight messages, and this test may hit it.
Instead of crashing, we retry sending the message.

Both of these are needed to make the test pass reliably on my machine.

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #4270 from dragos/issue/fix-flaky-test-SPARK-4631 and squashes the following commits:

f66c482 [Iulian Dragos] [SPARK-4631][streaming] Wait for a receiver to start before publishing test data.
d408a8e [Iulian Dragos] Install callback before connecting to MQTT broker.
2015-02-02 14:00:33 -08:00
WangTaoTheTonic f7741a9a72 [SPARK-5006][Deploy]spark.port.maxRetries doesn't work
https://issues.apache.org/jira/browse/SPARK-5006

I think the issue is produced in https://github.com/apache/spark/pull/1777.

Not digging mesos's backend yet. Maybe should add same logic either.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #3841 from WangTaoTheTonic/SPARK-5006 and squashes the following commits:

8cdf96d [WangTao] indent thing
2d86d65 [WangTaoTheTonic] fix line length
7cdfd98 [WangTaoTheTonic] fit for new HttpServer constructor
61a370d [WangTaoTheTonic] some minor fixes
bc6e1ec [WangTaoTheTonic] rebase
67bcb46 [WangTaoTheTonic] put conf at 3rd position, modify suite class, add comments
f450cd1 [WangTaoTheTonic] startServiceOnPort will use a SparkConf arg
29b751b [WangTaoTheTonic] rebase as ExecutorRunnableUtil changed to ExecutorRunnable
396c226 [WangTaoTheTonic] make the grammar more like scala
191face [WangTaoTheTonic] invalid value name
62ec336 [WangTaoTheTonic] spark.port.maxRetries doesn't work
2015-01-13 09:29:25 -08:00
GuoQiang Li 8a29dc716e [Minor]Resolve sbt warnings during build (MQTTStreamSuite.scala).
cc andrewor14

Author: GuoQiang Li <witgo@qq.com>

Closes #3989 from witgo/MQTTStreamSuite and squashes the following commits:

a6e967e [GuoQiang Li] Resolve sbt warnings during build (MQTTStreamSuite.scala).
2015-01-10 15:38:43 -08:00
bilna 4e1f12d997 [Minor] Fix import order and other coding style
fixed import order and other coding style

Author: bilna <bilnap@am.amrita.edu>
Author: Bilna P <bilna.p@gmail.com>

Closes #3966 from Bilna/master and squashes the following commits:

5e76f04 [bilna] fix import order and other coding style
5718d66 [bilna] Merge remote-tracking branch 'upstream/master'
ae56514 [bilna] Merge remote-tracking branch 'upstream/master'
acea3a3 [bilna] Adding dependency with scope test
28681fa [bilna] Merge remote-tracking branch 'upstream/master'
fac3904 [bilna] Correction in Indentation and coding style
ed9db4c [bilna] Merge remote-tracking branch 'upstream/master'
4b34ee7 [Bilna P] Update MQTTStreamSuite.scala
04503cf [bilna] Added embedded broker service for mqtt test
89d804e [bilna] Merge remote-tracking branch 'upstream/master'
fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master'
4b58094 [Bilna P] Update MQTTStreamSuite.scala
b1ac4ad [bilna] Added BeforeAndAfter
5f6bfd2 [bilna] Added BeforeAndAfter
e8b6623 [Bilna P] Update MQTTStreamSuite.scala
5ca6691 [Bilna P] Update MQTTStreamSuite.scala
8616495 [bilna] [SPARK-4631] unit test for MQTT
2015-01-09 14:45:28 -08:00
Marcelo Vanzin 48cecf673c [SPARK-4048] Enhance and extend hadoop-provided profile.
This change does a few things to make the hadoop-provided profile more useful:

- Create new profiles for other libraries / services that might be provided by the infrastructure
- Simplify and fix the poms so that the profiles are only activated while building assemblies.
- Fix tests so that they're able to run when the profiles are activated
- Add a new env variable to be used by distributions that use these profiles to provide the runtime
  classpath for Spark jobs and daemons.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:

82eb688 [Marcelo Vanzin] Add a comment.
eb228c0 [Marcelo Vanzin] Fix borked merge.
4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
371ebee [Marcelo Vanzin] Review feedback.
52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
322f882 [Marcelo Vanzin] Fix merge fail.
f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9640503 [Marcelo Vanzin] Cleanup child process log message.
115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
d1399ed [Marcelo Vanzin] Restore jetty dependency.
82a54b9 [Marcelo Vanzin] Remove unused profile.
5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
2015-01-08 17:15:13 -08:00
Sean Owen 4cba6eb420 SPARK-4159 [CORE] Maven build doesn't run JUnit test suites
This PR:

- Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
- Tells `surefire` to test only Java tests
- Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.

For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.

Author: Sean Owen <sowen@cloudera.com>

Closes #3651 from srowen/SPARK-4159 and squashes the following commits:

2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
2015-01-06 12:02:08 -08:00
bilna e767d7ddac [SPARK-4631] unit test for MQTT
Please review the unit test for MQTT

Author: bilna <bilnap@am.amrita.edu>
Author: Bilna P <bilna.p@gmail.com>

Closes #3844 from Bilna/master and squashes the following commits:

acea3a3 [bilna] Adding dependency with scope test
28681fa [bilna] Merge remote-tracking branch 'upstream/master'
fac3904 [bilna] Correction in Indentation and coding style
ed9db4c [bilna] Merge remote-tracking branch 'upstream/master'
4b34ee7 [Bilna P] Update MQTTStreamSuite.scala
04503cf [bilna] Added embedded broker service for mqtt test
89d804e [bilna] Merge remote-tracking branch 'upstream/master'
fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master'
4b58094 [Bilna P] Update MQTTStreamSuite.scala
b1ac4ad [bilna] Added BeforeAndAfter
5f6bfd2 [bilna] Added BeforeAndAfter
e8b6623 [Bilna P] Update MQTTStreamSuite.scala
5ca6691 [Bilna P] Update MQTTStreamSuite.scala
8616495 [bilna] [SPARK-4631] unit test for MQTT
2015-01-04 19:37:48 -08:00
Josh Rosen 352ed6bbe3 [SPARK-1010] Clean up uses of System.setProperty in unit tests
Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures.

This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself).

For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure.  See the block comment at the top of the ResetSystemProperties class for more details.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits:

0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools
3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext
4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties
4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering.
0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite.
7a3d224 [Josh Rosen] Fix trait ordering
3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite
bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite
655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite
3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite
cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite
8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait.
633a84a [Josh Rosen] Remove use of system properties in FileServerSuite
25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite
1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite
dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite
b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite
e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite
5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite
0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite
c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite
51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite
60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite
14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite
628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite
9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite.
4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class.
2014-12-30 18:12:20 -08:00
Peter Klipfel 2a2983f7c5 fixed spelling errors in documentation
changed "form" to "from" in 3 documentation entries for Kafka integration

Author: Peter Klipfel <peter@klipfel.me>

Closes #3691 from peterklipfel/master and squashes the following commits:

0fe7fc5 [Peter Klipfel] fixed spelling errors in documentation
2014-12-14 00:01:16 -08:00
zsxwing bcb5cdad61 [SPARK-3154][STREAMING] Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 'stopped'
Since `sequenceNumberToProcessor` and `stopped` are both protected by the lock `sequenceNumberToProcessor`, `ConcurrentHashMap` and `volatile` is unnecessary. So this PR updated them accordingly.

Author: zsxwing <zsxwing@gmail.com>

Closes #3634 from zsxwing/SPARK-3154 and squashes the following commits:

0d087ac [zsxwing] Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 'stopped'
2014-12-08 23:54:15 -08:00
Prabeesh K 5e7a6dcb8f [SPARK-4632] version update
Author: Prabeesh K <prabsmails@gmail.com>

Closes #3495 from prabeesh/master and squashes the following commits:

ab03d50 [Prabeesh K] Update pom.xml
8c6437e [Prabeesh K] Revert
e10b40a [Prabeesh K] version update
dbac9eb [Prabeesh K] Revert
ec0b1c3 [Prabeesh K] [SPARK-4632] version update
a835505 [Prabeesh K] [SPARK-4632] version update
831391b [Prabeesh K]  [SPARK-4632] version update
2014-11-30 20:51:53 -08:00
Prashant Sharma 1c938413ba SPARK-3962 Marked scope as provided for external projects.
Somehow maven shade plugin is set in infinite loop of creating effective pom.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #2959 from ScrapCodes/SPARK-3962/scope-provided and squashes the following commits:

994d1d3 [Prashant Sharma] Fixed failing flume tests
270b4fb [Prashant Sharma] Removed most of the unused code.
bb3bbfd [Prashant Sharma] SPARK-3962 Marked scope as provided for external.
2014-11-19 14:18:10 -08:00
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
jerryshao 5930f64bf0 [SPARK-4062][Streaming]Add ReliableKafkaReceiver in Spark Streaming Kafka connector
Add ReliableKafkaReceiver in Kafka connector to prevent data loss if WAL in Spark Streaming is enabled. Details and design doc can be seen in [SPARK-4062](https://issues.apache.org/jira/browse/SPARK-4062).

Author: jerryshao <saisai.shao@intel.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Saisai Shao <saisai.shao@intel.com>

Closes #2991 from jerryshao/kafka-refactor and squashes the following commits:

5461f1c [Saisai Shao] Merge pull request #8 from tdas/kafka-refactor3
eae4ad6 [Tathagata Das] Refectored KafkaStreamSuiteBased to eliminate KafkaTestUtils and made Java more robust.
fab14c7 [Tathagata Das] minor update.
149948b [Tathagata Das] Fixed mistake
14630aa [Tathagata Das] Minor updates.
d9a452c [Tathagata Das] Minor updates.
ec2e95e [Tathagata Das] Removed the receiver's locks and essentially reverted to Saisai's original design.
2a20a01 [jerryshao] Address some comments
9f636b3 [Saisai Shao] Merge pull request #5 from tdas/kafka-refactor
b2b2f84 [Tathagata Das] Refactored Kafka receiver logic and Kafka testsuites
e501b3c [jerryshao] Add Mima excludes
b798535 [jerryshao] Fix the missed issue
e5e21c1 [jerryshao] Change to while loop
ea873e4 [jerryshao] Further address the comments
98f3d07 [jerryshao] Fix comment style
4854ee9 [jerryshao] Address all the comments
96c7a1d [jerryshao] Update the ReliableKafkaReceiver unit test
8135d31 [jerryshao] Fix flaky test
a949741 [jerryshao] Address the comments
16bfe78 [jerryshao] Change the ordering of imports
0894aef [jerryshao] Add some comments
77c3e50 [jerryshao] Code refactor and add some unit tests
dd9aeeb [jerryshao] Initial commit for reliable Kafka receiver
2014-11-14 14:33:37 -08:00
Prashant Sharma daaca14c16 Support cross building for Scala 2.11
Let's give this another go using a version of Hive that shades its JLine dependency.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:

e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
2014-11-11 21:36:48 -08:00
jerryshao c8850a3d6d [SPARK-2492][Streaming] kafkaReceiver minor changes to align with Kafka 0.8
Update the KafkaReceiver's behavior when auto.offset.reset is set.

In Kafka 0.8, `auto.offset.reset` is a hint for out-range offset to seek to the beginning or end of the partition. While in the previous code `auto.offset.reset` is a enforcement to seek to the beginning or end immediately, this is different from Kafka 0.8 defined behavior.

Also deleting extesting ZK metadata in Receiver when multiple consumers are launched will introduce issue as mentioned in [SPARK-2383](https://issues.apache.org/jira/browse/SPARK-2383).

So Here we change to offer user to API to explicitly reset offset before create Kafka stream, while in the meantime keep the same behavior as Kafka 0.8 for parameter `auto.offset.reset`.

@tdas, would you please review this PR? Thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #1420 from jerryshao/kafka-fix and squashes the following commits:

d6ae94d [jerryshao] Address the comment to remove the resetOffset() function
de3a4c8 [jerryshao] Fix compile error
4a1c3f9 [jerryshao] Doc changes
b2c1430 [jerryshao] Move offset reset to a helper function to let user explicitly delete ZK metadata by calling this API
fac8fd6 [jerryshao] Changes to align with Kafka 0.8
2014-11-11 02:22:23 -08:00
maji2014 f8811a5695 [SPARK-4295][External]Fix exception in SparkSinkSuite
Handle exception in SparkSinkSuite, please refer to [SPARK-4295]

Author: maji2014 <maji3@asiainfo.com>

Closes #3177 from maji2014/spark-4295 and squashes the following commits:

312620a [maji2014] change a new statement for spark-4295
24c3d21 [maji2014] add log4j.properties for SparkSinkSuite and spark-4295
c807bf6 [maji2014] Fix exception in SparkSinkSuite
2014-11-11 02:18:27 -08:00
Aaron Davidson 2ebd1df3f1 [SPARK-4183] Close transport-related resources between SparkContexts
A leak of event loops may be causing test failures.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3053 from aarondav/leak and squashes the following commits:

e676d18 [Aaron Davidson] Typo!
8f96475 [Aaron Davidson] Keep original ssc semantics
7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
2014-11-02 16:26:24 -08:00
Josh Rosen 6c98c29ae0 [SPARK-4080] Only throw IOException from [write|read][Object|External]
If classes implementing Serializable or Externalizable interfaces throw
exceptions other than IOException or ClassNotFoundException from their
(de)serialization methods, then this results in an unhelpful
"IOException: unexpected exception type" rather than the actual exception that
produced the (de)serialization error.

This patch fixes this by adding a utility method that re-wraps any uncaught
exceptions in IOException (unless they are already instances of IOException).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2932 from JoshRosen/SPARK-4080 and squashes the following commits:

cd3a9be [Josh Rosen] [SPARK-4080] Only throw IOException from [write|read][Object|External].
2014-10-24 15:06:15 -07:00
Tathagata Das 4d26aca770 [SPARK-3912][Streaming] Fixed flakyFlumeStreamSuite
@harishreedharan @pwendell
See JIRA for diagnosis of the problem
https://issues.apache.org/jira/browse/SPARK-3912

The solution was to reimplement it.
1. Find a free port (by binding and releasing a server-scoket), and then use that port
2. Remove thread.sleep()s, instead repeatedly try to create a sender and send data and check whether data was sent. Use eventually() to minimize waiting time.
3. Check whether all the data was received, without caring about batches.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2773 from tdas/flume-test-fix and squashes the following commits:

93cd7f6 [Tathagata Das] Reimplimented FlumeStreamSuite to be more robust.
2014-10-13 22:46:49 -07:00
Reynold Xin 3888ee2f38 [SPARK-3748] Log thread name in unit test logs
Thread names are useful for correlating failures.

Author: Reynold Xin <rxin@apache.org>

Closes #2600 from rxin/log4j and squashes the following commits:

83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
2014-10-01 01:03:49 -07:00
Sean Owen 8764fe368b SPARK-3744 [STREAMING] FlumeStreamSuite will fail during port contention
Since it looked quite easy, I took the liberty of making a quick PR that just uses `Utils.startServiceOnPort` to fix this. It works locally for me.

Author: Sean Owen <sowen@cloudera.com>

Closes #2601 from srowen/SPARK-3744 and squashes the following commits:

ddc9319 [Sean Owen] Avoid port contention in tests by retrying several ports for Flume stream
2014-09-30 15:18:51 -07:00
Hari Shreedharan b235e01363 [SPARK-3686][STREAMING] Wait for sink to commit the channel before check...
...ing for the channel size.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #2531 from harishreedharan/sparksinksuite-fix and squashes the following commits:

30393c1 [Hari Shreedharan] Use more deterministic method to figure out when batches come in.
6ce9d8b [Hari Shreedharan] [SPARK-3686][STREAMING] Wait for sink to commit the channel before checking for the channel size.
2014-09-25 22:56:43 -07:00
jerryshao 74fb2ecf7a [SPARK-3615][Streaming]Fix Kafka unit test hard coded Zookeeper port issue
Details can be seen in [SPARK-3615](https://issues.apache.org/jira/browse/SPARK-3615).

Author: jerryshao <saisai.shao@intel.com>

Closes #2483 from jerryshao/SPARK_3615 and squashes the following commits:

8555563 [jerryshao] Fix Kafka unit test hard coded Zookeeper port issue
2014-09-24 17:18:55 -07:00
GuoQiang Li 607ae39c22 [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
Author: GuoQiang Li <witgo@qq.com>

Closes #2268 from witgo/SPARK-3397 and squashes the following commits:

eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
2014-09-06 15:04:50 -07:00
GuoQiang Li 905861906e [Minor]Remove extra semicolon in FlumeStreamSuite.scala
Author: GuoQiang Li <witgo@qq.com>

Closes #2265 from witgo/FlumeStreamSuite and squashes the following commits:

6c99e6e [GuoQiang Li] Remove extra semicolon in FlumeStreamSuite.scala
2014-09-04 10:28:23 -07:00
Hari Shreedharan 6f671d04fa [SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner.
Currently lot of errors get thrown from Avro IPC layer when the dstream
or sink is shutdown. This PR cleans it up. Some refactoring is done in the
receiver code to put all of the RPC code into a single Try and just recover
from that. The sink code has also been cleaned up.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #2065 from harishreedharan/clean-flume-shutdown and squashes the following commits:

f93a07c [Hari Shreedharan] Formatting fixes.
d7427cc [Hari Shreedharan] More fixes!
a0a8852 [Hari Shreedharan] Fix race condition, hopefully! Minor other changes.
4c9ed02 [Hari Shreedharan] Remove unneeded list in Callback handler. Other misc changes.
8fee36f [Hari Shreedharan] Scala-library is required, else maven build fails. Also catch InterruptedException in TxnProcessor.
445e700 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown
87232e0 [Hari Shreedharan] Refactor Flume Input Stream. Clean up code, better error handling.
9001d26 [Hari Shreedharan] Change log level to debug in TransactionProcessor#shutdown method
e7b8d82 [Hari Shreedharan] Incorporate review feedback
598efa7 [Hari Shreedharan] Clean up some exception handling code
e1027c6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown
ed608c8 [Hari Shreedharan] [SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner.
2014-08-27 02:39:02 -07:00
Sean Owen cd30db566a SPARK-2798 [BUILD] Correct several small errors in Flume module pom.xml files
(EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink `pom.xml`

- `scalatest` is not declared as a test-scope dependency
- Its Avro version doesn't match the rest of the build
- Its Flume version is not synced with the other Flume module
- The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version
- It depends on Scala Lang directly, which it shouldn't

Author: Sean Owen <sowen@cloudera.com>

Closes #1726 from srowen/SPARK-2798 and squashes the following commits:

a46e2c6 [Sean Owen] scalatest to test scope, harmonize Avro and Flume versions, remove direct Scala dependency, fix '2.10' in Flume dependency
2014-08-25 13:29:07 -07:00
Tathagata Das 3004074152 [SPARK-3169] Removed dependency on spark streaming test from spark flume sink
Due to maven bug https://jira.codehaus.org/browse/MNG-1378, maven could not resolve spark streaming classes required by the spark-streaming test-jar dependency of external/flume-sink. There is no particular reason that the external/flume-sink has to depend on Spark Streaming at all, so I am eliminating this dependency. Also I have removed the exclusions present in the Flume dependencies, as there is no reason to exclude them (they were excluded in the external/flume module to prevent dependency collisions with Spark).

Since Jenkins will test the sbt build and the unit test, I only tested maven compilation locally.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2101 from tdas/spark-sink-pom-fix and squashes the following commits:

8f42621 [Tathagata Das] Added Flume sink exclusions back, and added netty to test dependencies
93b559f [Tathagata Das] Removed dependency on spark streaming test from spark flume sink
2014-08-22 21:34:48 -07:00
Hari Shreedharan 8c5a222693 [SPARK-3054][STREAMING] Add unit tests for Spark Sink.
This patch adds unit tests for Spark Sink.

It also removes the private[flume] for Spark Sink,
since the sink is instantiated from Flume configuration (looks like this is ignored by reflection which is used by
Flume, but we should still remove it anyway).

Author: Hari Shreedharan <hshreedharan@apache.org>
Author: Hari Shreedharan <hshreedharan@cloudera.com>

Closes #1958 from harishreedharan/spark-sink-test and squashes the following commits:

e3110b9 [Hari Shreedharan] Add a sleep to allow sink to commit the transactions
120b81e [Hari Shreedharan] Fix complexity in threading model in test
4df5be6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into spark-sink-test
c9190d1 [Hari Shreedharan] Indentation and spaces changes
7fedc5a [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into spark-sink-test
abc20cb [Hari Shreedharan] Minor test changes
7b9b649 [Hari Shreedharan] Merge branch 'master' into spark-sink-test
f2c56c9 [Hari Shreedharan] Update SparkSinkSuite.scala
a24aac8 [Hari Shreedharan] Remove unused var
c86d615 [Hari Shreedharan] [SPARK-3054][STREAMING] Add unit tests for Spark Sink.
2014-08-20 04:09:54 -07:00
Hari Shreedharan 95470a03ae [HOTFIX][STREAMING] Allow the JVM/Netty to decide which port to bind to in Flume Polling Tests.
Author: Hari Shreedharan <harishreedharan@gmail.com>

Closes #1820 from harishreedharan/use-free-ports and squashes the following commits:

b939067 [Hari Shreedharan] Remove unused import.
67856a8 [Hari Shreedharan] Remove findFreePort.
0ea51d1 [Hari Shreedharan] Make some changes to getPort to use map on the serverOpt.
1fb0283 [Hari Shreedharan] Merge branch 'master' of https://github.com/apache/spark into use-free-ports
b351651 [Hari Shreedharan] Allow Netty to choose port, and query it to decide the port to bind to. Leaving findFreePort as is, if other tests want to use it at some point.
e6c9620 [Hari Shreedharan] Making sure the second sink uses the correct port.
11c340d [Hari Shreedharan] Add info about race condition to scaladoc.
e89d135 [Hari Shreedharan] Adding Scaladoc.
6013bb0 [Hari Shreedharan] [STREAMING] Find free ports to use before attempting to create Flume Sink in Flume Polling Suite
2014-08-17 19:50:31 -07:00
Andrew Or c6889d2cb9 [HOTFIX][Streaming] Handle port collisions in flume polling test
This is failing my tests in #1777. @tdas

Author: Andrew Or <andrewor14@gmail.com>

Closes #1803 from andrewor14/fix-flaky-streaming-test and squashes the following commits:

ea11a03 [Andrew Or] Catch all exceptions caused by BindExceptions
54a0ca0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-flaky-streaming-test
664095c [Andrew Or] Tone down bind exception message
af3ddc9 [Andrew Or] Handle port collisions in flume polling test
2014-08-06 16:34:53 -07:00
Tathagata Das ee7f30856b [SPARK-1022][Streaming][HOTFIX] Fixed zookeeper dependency of Kafka
https://github.com/apache/spark/pull/1751 caused maven builds to fail.

```
~/Apache/spark(branch-1.1|✔) ➤ mvn -U -DskipTests clean install
.
.
.
[error] Apache/spark/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala:36: object NIOServerCnxnFactory is not a member of package org.apache.zookeeper.server
[error] import org.apache.zookeeper.server.NIOServerCnxnFactory
[error]        ^
[error] Apache/spark/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala:199: not found: type NIOServerCnxnFactory
[error]     val factory = new NIOServerCnxnFactory()
[error]                       ^
[error] two errors found
[error] Compile failed at Aug 5, 2014 1:42:36 PM [0.503s]
```

The problem is how SBT and Maven resolves multiple versions of the same library, which in this case, is Zookeeper. Observing and comparing the dependency trees from Maven and SBT showed this. Spark depends on ZK 3.4.5 whereas Apache Kafka transitively depends on upon ZK 3.3.4. SBT decides to evict 3.3.4 and use the higher version 3.4.5. But Maven decides to stick to the closest (in the tree) dependent version of 3.3.4. And 3.3.4 does not have NIOServerCnxnFactory.

The solution in this patch excludes zookeeper from the apache-kafka dependency in streaming-kafka module so that it just inherits zookeeper from Spark core.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1797 from tdas/kafka-zk-fix and squashes the following commits:

94b3931 [Tathagata Das] Fixed zookeeper dependency of Kafka
2014-08-05 23:41:34 -07:00
jerryshao e87075df97 [SPARK-1022][Streaming] Add Kafka real unit test
This PR is a updated version of (https://github.com/apache/spark/pull/557) to actually test sending and receiving data through Kafka, and fix previous flaky issues.

@tdas, would you mind reviewing this PR? Thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #1751 from jerryshao/kafka-unit-test and squashes the following commits:

b6a505f [jerryshao] code refactor according to comments
5222330 [jerryshao] Change JavaKafkaStreamSuite to better test it
5525f10 [jerryshao] Fix flaky issue of Kafka real unit test
4559310 [jerryshao] Minor changes for Kafka unit test
860f649 [jerryshao] Minor style changes, and tests ignored due to flakiness
796d4ca [jerryshao] Add real Kafka streaming test
2014-08-05 10:40:28 -07:00
Patrick Wendell 44460ba594 HOTFIX: Fix concurrency issue in FlumePollingStreamSuite.
This has been failing on master. One possible cause is that the port
gets contended if multiple test runs happen concurrently and they
hit this test at the same time. Since this test takes a long time
(60 seconds) that's very plausible. This patch randomizes the port
used in this test to avoid contention.
2014-08-02 01:16:13 -07:00
Patrick Wendell 25cad6adf6 HOTFIX: Fixing test error in maven for flume-sink.
We needed to add an explicit dependency on scalatest since this
module will not get it from spark core like others do.
2014-08-02 00:58:33 -07:00
jerryshao a32f0fb73a [SPARK-2103][Streaming] Change to ClassTag for KafkaInputDStream and fix reflection issue
This PR updates previous Manifest for KafkaInputDStream's Decoder to ClassTag, also fix the problem addressed in [SPARK-2103](https://issues.apache.org/jira/browse/SPARK-2103).

Previous Java interface cannot actually get the type of Decoder, so when using this Manifest to reconstruct the decode object will meet reflection exception.

Also for other two Java interfaces, ClassTag[String] is useless because calling Scala API will get the right implicit ClassTag.

Current Kafka unit test cannot actually verify the interface. I've tested these interfaces in my local and distribute settings.

Author: jerryshao <saisai.shao@intel.com>

Closes #1508 from jerryshao/SPARK-2103 and squashes the following commits:

e90c37b [jerryshao] Add Mima excludes
7529810 [jerryshao] Change Manifest to ClassTag for KafkaInputDStream's Decoder and fix Decoder construct issue when using Java API
2014-08-01 04:32:46 -07:00
Sean Owen 6ab96a6fd0 SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
The Maven-based builds in the build matrix have been failing for a few days:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/

On inspection, it looks like the Spark SQL Java tests don't compile:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull

I confirmed it by repeating the command vs master:

`mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`

The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`

Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).

Author: Sean Owen <srowen@gmail.com>

Closes #1660 from srowen/SPARK-2749 and squashes the following commits:

858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
2014-07-30 15:04:33 -07:00
Hari Shreedharan 800ecff4b1 [STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu...
...sh model

Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
receiver fails, it currently has to be restarted on the same node to be able to receive data.

This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
multiple threads for better performance.

Author: Hari Shreedharan <harishreedharan@gmail.com>
Author: Hari Shreedharan <hshreedharan@apache.org>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: harishreedharan <hshreedharan@cloudera.com>

Closes #807 from harishreedharan/master and squashes the following commits:

e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
1f47364 [Hari Shreedharan] Minor fixes.
73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
65b76b4 [Hari Shreedharan] Fixing the unit test.
e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
799509f [Hari Shreedharan] Fix a compile issue.
3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
8c00289 [Hari Shreedharan] More debug messages
393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
205034d [Hari Shreedharan] Merging master in
4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
2014-07-29 11:11:29 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Tathagata Das a45d5480f6 [SPARK-2464][Streaming] Fixed Twitter stream stopping bug
Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1577 from tdas/twitter-stop and squashes the following commits:

011b525 [Tathagata Das] Fixed Twitter stream stopping bug.
2014-07-24 15:59:09 -07:00
Sean Owen 1fcd5dcdd8 SPARK-1478.2 Fix incorrect NioServerSocketChannelFactory constructor call
The line break inadvertently means this was interpreted as a call to the no-arg constructor. This doesn't exist in older Netty even. (Also fixed a val name typo.)

Author: Sean Owen <srowen@gmail.com>

Closes #1466 from srowen/SPARK-1478.2 and squashes the following commits:

59c3501 [Sean Owen] Line break caused Scala to interpret NioServerSocketChannelFactory constructor as the no-arg version, which is not even present in some versions of Netty
2014-07-17 12:20:48 -07:00
tmalaska 40a8fef4e6 [SPARK-1478].3: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
This is a modified version of this PR https://github.com/apache/spark/pull/1168 done by @tmalaska
Adds MIMA binary check exclusions.

Author: tmalaska <ted.malaska@cloudera.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1347 from tdas/FLUME-1915 and squashes the following commits:

96065df [Tathagata Das] Added Mima exclusion for FlumeReceiver.
41d5338 [tmalaska] Address line 57 that was too long
12617e5 [tmalaska] SPARK-1478: Upgrade FlumeInputDStream's Flume...
2014-07-10 13:15:02 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Sean Owen 476581e8c8 SPARK-2034. KafkaInputDStream doesn't close resources and may prevent JVM shutdown
Tobias noted today on the mailing list:

========

I am trying to use Spark Streaming with Kafka, which works like a
charm – except for shutdown. When I run my program with "sbt
run-main", sbt will never exit, because there are two non-daemon
threads left that don't die.
I created a minimal example at
<https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-kafkadoesntshutdown-scala>.
It starts a StreamingContext and does nothing more than connecting to
a Kafka server and printing what it receives. Using the `future
Unknown macro: { ... }
` construct, I shut down the StreamingContext after some seconds and
then print the difference between the threads at start time and at end
time. The output can be found at
<https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output1>.
There are a number of threads remaining that will prevent sbt from
exiting.
When I replace `KafkaUtils.createStream(...)` with a call that does
exactly the same, except that it calls `consumerConnector.shutdown()`
in `KafkaReceiver.onStop()` (which it should, IMO), the output is as
shown at <https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output2>.
Does anyone have any idea what is going on here and why the program
doesn't shut down properly? The behavior is the same with both kafka
0.8.0 and 0.8.1.1, by the way.

========

Something similar was noted last year:

http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3C1380220041.2428.YahooMailNeo@web160804.mail.bf1.yahoo.com%3E

KafkaInputDStream doesn't close `ConsumerConnector` in `onStop()`, and does not close the `Executor` it creates. The latter leaves non-daemon threads and can prevent the JVM from shutting down even if streaming is closed properly.

Author: Sean Owen <sowen@cloudera.com>

Closes #980 from srowen/SPARK-2034 and squashes the following commits:

9f31a8d [Sean Owen] Restore ClassTag to private class because MIMA flags it; is the shadowing intended?
2d579a8 [Sean Owen] Close ConsumerConnector in onStop; shutdown() the local Executor that is created so that its threads stop when done; close the Zookeeper client even on exception; fix a few typos; log exceptions that otherwise vanish
2014-06-22 01:12:15 -07:00
joyyoj 2966044307 [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not re...
flume event sent to Spark will fail if the body is too large and numHeaders is greater than zero

Author: joyyoj <sunshch@gmail.com>

Closes #951 from joyyoj/master and squashes the following commits:

f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly
2014-06-10 17:26:17 -07:00
Takuya UESHIN 7c160293d6 [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits:

e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
2014-06-05 11:27:33 -07:00
David Lemieux 4312cf0bad Spark 1916
The changes could be ported back to 0.9 as well.
Changing in.read to in.readFully to read the whole input stream rather than the first 1020 bytes.
This should ok considering that Flume caps the body size to 32K by default.

Author: David Lemieux <david.lemieux@radialpoint.com>

Closes #865 from lemieud/SPARK-1916 and squashes the following commits:

a265673 [David Lemieux] Updated SparkFlumeEvent to read the whole stream rather than the first X bytes.
(cherry picked from commit 0b769b73fb)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-05-28 15:51:32 -07:00
Prashant Sharma 46324279da Package docs
This is a few changes based on the original patch by @scrapcodes.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #785 from pwendell/package-docs and squashes the following commits:

c32b731 [Patrick Wendell] Changes based on Prashant's patch
c0463d3 [Prashant Sharma] added eof new line
ce8bf73 [Prashant Sharma] Added eof new line to all files.
4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
2014-05-14 22:24:41 -07:00
Tathagata Das 68f28dabe9 Fixed streaming examples docs to use run-example instead of spark-submit
Pretty self-explanatory

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #722 from tdas/example-fix and squashes the following commits:

7839979 [Tathagata Das] Minor changes.
0673441 [Tathagata Das] Fixed java docs of java streaming example
e687123 [Tathagata Das] Fixed scala style errors.
9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit.
2014-05-14 04:17:32 -07:00
Sean Owen 7120a2979d SPARK-1798. Tests should clean up temp files
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.

Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.

The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.

Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.

_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._

Author: Sean Owen <sowen@cloudera.com>

Closes #732 from srowen/SPARK-1798 and squashes the following commits:

5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
2014-05-12 14:16:19 -07:00
Sean Owen 2b7bd29eb6 SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure
TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.

I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)

velvia notes:
"I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."

There are at least 3 versions of Netty in play in the build:

- the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
- the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
- but, Spark Core directly uses io.netty:netty-all:4.0.17.Final

The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.

The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.

But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.

If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.

So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:

- Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
- Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
- Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
- Update SBT build accordingly

A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.

Author: Sean Owen <sowen@cloudera.com>

Closes #723 from srowen/SPARK-1789 and squashes the following commits:

43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
2014-05-10 20:50:40 -07:00
witgo 030f2c2126 Improved build configuration
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <witgo@qq.com>

Closes #480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
2014-04-28 22:51:46 -07:00
Mridul Muralidharan 968c0187a1 SPARK-1586 Windows build fixes
Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.

Author: Mridul Muralidharan <mridulm80@apache.org>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #505 from mridulm/windows_fixes and squashes the following commits:

ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
3267f4b [Mridul Muralidharan] Fix build failures
35b277a [Mridul Muralidharan] Fix Scalastyle failures
bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
1337abd [Mridul Muralidharan] fix classpath while running in windows
2014-04-24 20:48:33 -07:00
tmalaska d5c6ae6cc3 SPARK-1584: Upgrade Flume dependency to 1.4.0
Updated the Flume dependency in the maven pom file and the scala build file.

Author: tmalaska <ted.malaska@cloudera.com>

Closes #507 from tmalaska/master and squashes the following commits:

79492c8 [tmalaska] excluded all thrift
159c3f1 [tmalaska] fixed the flume pom file issues
5bf56a7 [tmalaska] Upgrade flume version
2014-04-24 20:31:17 -07:00
Tathagata Das 04c37b6f74 [SPARK-1332] Improve Spark Streaming's Network Receiver and InputDStream API [WIP]
The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51

Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability.

Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. This is still not yet implemented.

This PR is blocked on the graceful shutdown PR #247

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #300 from tdas/network-receiver-api and squashes the following commits:

ea27b38 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3a4777c [Tathagata Das] Renamed NetworkInputDStream to ReceiverInputDStream, and ActorReceiver related stuff.
838dd39 [Tathagata Das] Added more events to the StreamingListener to report errors and stopped receivers.
a75c7a6 [Tathagata Das] Address some PR comments and fixed other issues.
91bfa72 [Tathagata Das] Fixed bugs.
8533094 [Tathagata Das] Scala style fixes.
028bde6 [Tathagata Das] Further refactored receiver to allow restarting of a receiver.
43f5290 [Tathagata Das] Made functions that create input streams return InputDStream and NetworkInputDStream, for both Scala and Java.
2c94579 [Tathagata Das] Fixed graceful shutdown by removing interrupts on receiving thread.
9e37a0b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3223e95 [Tathagata Das] Refactored the code that runs the NetworkReceiver into further classes and traits to make them more testable.
a36cc48 [Tathagata Das] Refactored the NetworkReceiver API for future stability.
2014-04-21 19:04:49 -07:00
Sandeep 930b70f052 Remove Unnecessary Whitespace's
stack these together in a commit else they show up chunk by chunk in different commits.

Author: Sandeep <sandeep@techaddict.me>

Closes #380 from techaddict/white_space and squashes the following commits:

b58f294 [Sandeep] Remove Unnecessary Whitespace's
2014-04-10 15:04:13 -07:00
Prashant Sharma d666053679 SPARK-1352 - Comment style single space before ending */ check.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits:

6cde61e [Prashant Sharma] comment style space before ending */ check.
2014-03-30 10:06:56 -07:00
Prashant Sharma 60abc25254 SPARK-1096, a space after comment start style checker.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits:

214135a [Prashant Sharma] Review feedback.
5eba88c [Prashant Sharma] Fixed style checks for ///+ comments.
e54b2f8 [Prashant Sharma] improved message, work around.
83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version.
810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker.
ba33193 [Prashant Sharma] scala style as a project
2014-03-28 00:21:49 -07:00
Sean Owen 97e4459e1e SPARK-1254. Consolidate, order, and harmonize repository declarations in Maven/SBT builds
This suggestion addresses a few minor suboptimalities with how repositories are handled.

1) Use HTTPS consistently to access repos, instead of HTTP

2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)

3) Update SBT build to match Maven build in this regard

4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.

Author: Sean Owen <sowen@cloudera.com>

Closes #145 from srowen/SPARK-1254 and squashes the following commits:

42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
2014-03-15 16:44:34 -07:00
Sandy Ryza a99fb3747a SPARK-1193. Fix indentation in pom.xmls
Author: Sandy Ryza <sandy@cloudera.com>

Closes #91 from sryza/sandy-spark-1193 and squashes the following commits:

a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
2014-03-07 23:10:35 -08:00
Prashant Sharma 181ec50307 [java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits:

95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch.
85a954e [Prashant Sharma] Nit. import orderings.
673f7ac [Prashant Sharma] Added support for -java-home as well
80a13e8 [Prashant Sharma] Used fake class tag syntax
26eb3f6 [Prashant Sharma] Patrick's comments on PR.
35d8d79 [Prashant Sharma] Specified java 8 building in the docs
31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag.
4ab87d3 [Prashant Sharma] Review feedback on the pr
c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
2014-03-03 22:31:30 -08:00
Patrick Wendell c3f5e07533 SPARK-1121: Include avro for yarn-alpha builds
This lets us explicitly include Avro based on a profile for 0.23.X
builds. It makes me sad how convoluted it is to express this logic
in Maven. @tgraves and @sryza curious if this works for you.

I'm also considering just reverting to how it was before. The only
real problem was that Spark advertised a dependency on Avro
even though it only really depends transitively on Avro through
other deps.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #49 from pwendell/avro-build-fix and squashes the following commits:

8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
2014-03-02 15:18:19 -08:00
Sean Owen fd31adbf27 SPARK-1084.2 (resubmitted)
(Ported from https://github.com/apache/incubator-spark/pull/650 )

This adds one more change though, to fix the scala version warning introduced by json4s recently.

Author: Sean Owen <sowen@cloudera.com>

Closes #32 from srowen/SPARK-1084.2 and squashes the following commits:

9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency
1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
2014-03-02 14:27:53 -08:00
Patrick Wendell 1fd2bfd3dd Remove remaining references to incubation
This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #51 from pwendell/tlp and squashes the following commits:

d553b1b [Patrick Wendell] Remove remaining references to incubation
2014-03-02 01:00:16 -08:00