Commit graph

60 commits

Author SHA1 Message Date
hyukjinkwon f7435bec6a [SPARK-24044][PYTHON] Explicitly print out skipped tests from unittest module
## What changes were proposed in this pull request?

This PR proposes to remove duplicated dependency checking logics and also print out skipped tests from unittests.

For example, as below:

```
Skipped tests in pyspark.sql.tests with pypy:
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
...

Skipped tests in pyspark.sql.tests with python3:
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
...
```

Currently, it's not printed out in the console. I think we should better print out skipped tests in the console.

## How was this patch tested?

Manually tested. Also, fortunately, Jenkins has good environment to test the skipped output.

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21107 from HyukjinKwon/skipped-tests-print.
2018-04-26 15:11:42 -07:00
Liang-Chi Hsieh 8bb0df2c65 [SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingListener
## What changes were proposed in this pull request?

The `StreamingListener` in PySpark side seems to be lack of `onStreamingStarted` method. This patch adds it and a test for it.

This patch also includes a trivial doc improvement for `createDirectStream`.

Original PR is #21057.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #21098 from viirya/SPARK-24014.
2018-04-19 10:00:57 +08:00
hyukjinkwon 56e8f48a43 [SPARK-23695][PYTHON] Fix the error message for Kinesis streaming tests
## What changes were proposed in this pull request?

This PR proposes to fix the error message for Kinesis in PySpark when its jar is missing but explicitly enabled.

```bash
ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests
```

Before:

```
Skipped test_flume_stream (enable by setting environment variable ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in <module>
    % kinesis_asl_assembly_dir) +
NameError: name 'kinesis_asl_assembly_dir' is not defined
```

After:

```
Skipped test_flume_stream (enable by setting environment variable ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/.../spark/python/pyspark/streaming/tests.py", line 1576, in <module>
    "You need to build Spark with 'build/sbt -Pkinesis-asl "
Exception: Failed to find Spark Streaming Kinesis assembly jar in /.../spark/external/kinesis-asl-assembly. You need to build Spark with 'build/sbt -Pkinesis-asl assembly/package streaming-kinesis-asl-assembly/assembly'or 'build/mvn -Pkinesis-asl package' before running this test.
```

## How was this patch tested?

Manually tested.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20834 from HyukjinKwon/minor-variable.
2018-03-15 10:55:33 -07:00
Bruce Robbins 23ac3aaba4 [SPARK-23417][PYTHON] Fix the build instructions supplied by exception messages in python streaming tests
## What changes were proposed in this pull request?

Fix the build instructions supplied by exception messages in python streaming tests.

I also added -DskipTests to the maven instructions to avoid the 170 minutes of scala tests that occurs each time one wants to add a jar to the assembly directory.

## How was this patch tested?

- clone branch
- run build/sbt package
- run python/run-tests --modules "pyspark-streaming" , expect error message
- follow instructions in error message. i.e., run build/sbt assembly/package streaming-kafka-0-8-assembly/assembly
- rerun python tests, expect error message
- follow instructions in error message. i.e run build/sbt -Pflume assembly/package streaming-flume-assembly/assembly
- rerun python tests, see success.
- repeated all of the above for mvn version of the process.

Author: Bruce Robbins <bersprockets@gmail.com>

Closes #20638 from bersprockets/SPARK-23417_propa.
2018-02-28 09:25:02 +09:00
Sean Owen 0c03297bf0 [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2
## What changes were proposed in this pull request?

Move flume behind a profile, take 2. See https://github.com/apache/spark/pull/19365 for most of the back-story.

This change should fix the problem by removing the examples module dependency and moving Flume examples to the module itself. It also adds deprecation messages, per a discussion on dev about deprecating for 2.3.0.

## How was this patch tested?

Existing tests, which still enable flume integration.

Author: Sean Owen <sowen@cloudera.com>

Closes #19412 from srowen/SPARK-22142.2.
2017-10-06 15:08:28 +01:00
gatorsmile 472864014c Revert "[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile"
This reverts commit a2516f41ae.
2017-09-29 11:45:58 -07:00
Sean Owen a2516f41ae [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile
## What changes were proposed in this pull request?

Add 'flume' profile to enable Flume-related integration modules

## How was this patch tested?

Existing tests; no functional change

Author: Sean Owen <sowen@cloudera.com>

Closes #19365 from srowen/SPARK-22142.
2017-09-29 08:26:53 +01:00
Sean Owen 4fbf748bf8 [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile
## What changes were proposed in this pull request?

Put Kafka 0.8 support behind a kafka-0-8 profile.

## How was this patch tested?

Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.

Author: Sean Owen <sowen@cloudera.com>

Closes #19134 from srowen/SPARK-21893.
2017-09-13 10:10:40 +01:00
Shixiong Zhu f9a50ba2d1 [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds
## What changes were proposed in this pull request?

Saw the following failure locally:

```
Traceback (most recent call last):
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
    self._test_func(input, func, expected, sort=True, input2=input2)
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
    self.assertEqual(expected, result)
AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []

First list contains 3 additional elements.
First extra element 0:
[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]

+ []
- [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
-  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
-  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
```

It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120

It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17597 from zsxwing/SPARK-20285.
2017-04-10 14:06:49 -07:00
Shixiong Zhu 376d782164 [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable
## What changes were proposed in this pull request?

Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.

This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17323 from zsxwing/SPARK-19986.
2017-03-17 11:12:23 -07:00
Takeshi YAMAMURO 256a3a8013 [SPARK-18020][STREAMING][KINESIS] Checkpoint SHARD_END to finish reading closed shards
## What changes were proposed in this pull request?
This pr is to fix an issue occurred when resharding Kinesis streams; the resharding makes the KCL throw an exception because Spark does not checkpoint `SHARD_END` when finishing reading closed shards in `KinesisRecordProcessor#shutdown`. This bug finally leads to stopping subscribing new split (or merged) shards.

## How was this patch tested?
Added a test in `KinesisStreamSuite` to check if it works well when splitting/merging shards.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #16213 from maropu/SPARK-18020.
2017-01-25 17:38:48 -08:00
Mariusz Strzelecki 29081b587f [SPARK-16950] [PYSPARK] fromOffsets parameter support in KafkaUtils.createDirectStream for python3
## What changes were proposed in this pull request?

Ability to use KafkaUtils.createDirectStream with starting offsets in python 3 by using java.lang.Number instead of Long during param mapping in scala helper. This allows py4j to pass Integer or Long to the map and resolves ClassCastException problems.

## How was this patch tested?

unit tests

jerryshao  - could you please look at this PR?

Author: Mariusz Strzelecki <mariusz.strzelecki@allegrogroup.com>

Closes #14540 from szczeles/kafka_pyspark.
2016-08-09 09:44:43 -07:00
cody koeninger 89e67d6667 [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions

## How was this patch tested?
Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #12946 from koeninger/SPARK-15085.
2016-05-11 12:15:41 -07:00
Xin Ren 86475520f8 [SPARK-14936][BUILD][TESTS] FlumePollingStreamSuite is slow
https://issues.apache.org/jira/browse/SPARK-14936

## What changes were proposed in this pull request?

FlumePollingStreamSuite contains two tests which run for a minute each. This seems excessively slow and we should speed it up if possible.

In this PR, instead of creating `StreamingContext` directly from `conf`, here an underlying `SparkContext` is created before all and it is used to create  each`StreamingContext`.

Running time is reduced by avoiding multiple `SparkContext` creations and destroys.

## How was this patch tested?

Tested on my local machine running `testOnly *.FlumePollingStreamSuite`

Author: Xin Ren <iamshrek@126.com>

Closes #12845 from keypointt/SPARK-14936.
2016-05-10 15:12:47 -07:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Shixiong Zhu 24587ce433 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
## What changes were proposed in this pull request?

This PR moves flume back to Spark as per the discussion in the dev mail-list.

## How was this patch tested?

Existing Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11895 from zsxwing/move-flume-back.
2016-03-25 17:37:16 -07:00
Shixiong Zhu 06dec37455 [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
## What changes were proposed in this pull request?

Currently there are a few sub-projects, each for integrating with different external sources for Streaming.  Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages

- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter

They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.

I have already copied these projects to https://github.com/spark-packages

## How was this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11672 from zsxwing/remove-external-pkg.
2016-03-14 16:56:04 -07:00
Josh Rosen 07cb323e7a [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
2016-03-14 12:22:02 -07:00
Sean Owen 256704c771 [SPARK-13595][BUILD] Move docker, extras modules into external
## What changes were proposed in this pull request?

Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`

## How was this patch tested?

This is tested with Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11523 from srowen/SPARK-13595.
2016-03-09 18:27:44 +00:00
Shixiong Zhu 335f10edad [SPARK-7997][CORE] Add rpcEnv.awaitTermination() back to SparkEnv
`rpcEnv.awaitTermination()` was not added in #10854 because some Streaming Python tests hung forever.

This patch fixed the hung issue and added rpcEnv.awaitTermination() back to SparkEnv.

Previously, Streaming Kafka Python tests shutdowns the zookeeper server before stopping StreamingContext. Then when stopping StreamingContext, KafkaReceiver may be hung due to https://issues.apache.org/jira/browse/KAFKA-601, hence, some thread of RpcEnv's Dispatcher cannot exit and rpcEnv.awaitTermination is hung.The patch just changed the shutdown order to fix it.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11031 from zsxwing/awaitTermination.
2016-02-02 21:13:54 -08:00
Gábor Lipták 9bb35c5b59 [SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.

Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10850 from mengxr/SPARK-11295.
2016-01-20 11:11:10 -08:00
Xiangrui Meng beda901422 Revert "[SPARK-11295] Add packages to JUnit output for Python tests"
This reverts commit c6f971b4ae.
2016-01-19 16:51:17 -08:00
Gábor Lipták c6f971b4ae [SPARK-11295] Add packages to JUnit output for Python tests
SPARK-11295 Add packages to JUnit output for Python tests

This improves grouping/display of test case results.

Author: Gábor Lipták <gliptak@gmail.com>

Closes #9263 from gliptak/SPARK-11295.
2016-01-19 14:06:53 -08:00
jerryshao 8d49400921 [SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in Python API
The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.

Author: jerryshao <sshao@hortonworks.com>

Closes #10350 from jerryshao/SPARK-12353.
2015-12-28 10:43:23 +00:00
Bryan Cutler 6a6c1fc5c8 [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark
Adding ability to define an initial state RDD for use with updateStateByKey PySpark.  Added unit test and changed stateful_network_wordcount example to use initial RDD.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
2015-12-10 14:21:15 -08:00
Burak Yavuz 302d68de87 [SPARK-12058][STREAMING][KINESIS][TESTS] fix Kinesis python tests
Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar.

However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL.

cc zsxwing tdas

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #10050 from brkyvz/kinesis-py.
2015-12-04 12:08:42 -08:00
jerryshao f292018f8e [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue
Fixed a minor race condition in #10017

Closes #10017

Author: jerryshao <sshao@hortonworks.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10074 from zsxwing/review-pr10017.
2015-12-01 15:26:10 -08:00
Shixiong Zhu edb26e7f4e [SPARK-12058][HOTFIX] Disable KinesisStreamTests
KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/

Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually.

This PR just disabled the test to unblock #10039

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10047 from zsxwing/disable-python-kinesis-test.
2015-11-30 16:31:59 -08:00
Shixiong Zhu d29e2ef4cf [SPARK-11935][PYSPARK] Send the Python exceptions in TransformFunction and TransformFunctionSerializer to Java
The Python exception track in TransformFunction and TransformFunctionSerializer is not sent back to Java. Py4j just throws a very general exception, which is hard to debug.

This PRs adds `getFailure` method to get the failure message in Java side.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9922 from zsxwing/SPARK-11935.
2015-11-25 11:47:21 -08:00
Shixiong Zhu be7a2cfd97 [SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction and TransformFunctionSerializer
TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9847 from zsxwing/pyspark-streaming-exception.
2015-11-20 14:23:01 -08:00
David Tolpin 599a8c6e2b [SPARK-11812][PYSPARK] invFunc=None works properly with python's reduceByKeyAndWindow
invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9775 from dtolpin/master.
2015-11-19 13:57:23 -08:00
jerryshao 75a2922910 [SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API
Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.
2015-11-17 16:57:52 -08:00
Shixiong Zhu 928d631625 [SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch
We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9707 from zsxwing/fix-checkpoint.
2015-11-17 14:48:29 -08:00
Daniel Jalova ace0db4714 [SPARK-6328][PYTHON] Python API for StreamingListener
Author: Daniel Jalova <djalova@us.ibm.com>

Closes #9186 from djalova/SPARK-6328.
2015-11-16 11:29:27 -08:00
Shixiong Zhu ec80c0c2fc [SPARK-11706][STREAMING] Fix the bug that Streaming Python tests cannot report failures
This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9669 from zsxwing/streaming-python-tests.
2015-11-13 00:30:27 -08:00
Nick Evans 859dff56eb [SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout return properly
This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`.

tdas zsxwing

Author: Nick Evans <me@nicolasevans.org>

Closes #9336 from manygrams/fix_await_termination_or_timeout.
2015-11-05 09:18:20 +00:00
Nick Evans 8f888eea1a [SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition from the Kafka Streaming API
jerryshao tdas

I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise.

Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```

You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```

Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```

After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```

I couldn't find any tests - am I missing something?

Author: Nick Evans <me@nicolasevans.org>

Closes #9236 from manygrams/topic_and_partition_equality.
2015-10-27 01:29:06 -07:00
Gábor Lipták 163d53e829 [SPARK-7021] Add JUnit output for Python unit tests
WIP

Author: Gábor Lipták <gliptak@gmail.com>

Closes #8323 from gliptak/SPARK-7021.
2015-10-22 15:27:11 -07:00
Holden Karau e18b571c33 [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9
Upgrade to Py4j0.9

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
2015-10-20 10:52:49 -07:00
Yanbo Liang 35e8ab9390 [SPARK-10615] [PYSPARK] change assertEquals to assertEqual
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8814 from yanboliang/spark-10615.
2015-09-18 09:53:52 -07:00
zsxwing 4e0395ddb7 [SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars
This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build.

I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests.

Author: zsxwing <zsxwing@gmail.com>

Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits:

e0b5818 [zsxwing] Fix the sbt build
c697627 [zsxwing] Add the jar pathes to the exception message
be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
2015-08-24 12:38:01 -07:00
Tathagata Das 053d94fcf3 [SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts
The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following:
1. Use the same code path as Java to check whether a valid checkpoint exists
2. Create a new Python SparkContext only if there no active one.

There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8366 from tdas/SPARK-10142 and squashes the following commits:

3afa666 [Tathagata Das] Added tests
2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists
9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
2015-08-23 19:24:32 -07:00
jerryshao d89cc38b33 [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function
Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122).

tdas , please help to review.

Author: jerryshao <sshao@hortonworks.com>

Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits:

4039b16 [jerryshao] Fix getOffsetRanges in transform() bug
2015-08-21 13:15:35 -07:00
Tathagata Das 5b8bb1b213 [SPARK-9572] [STREAMING] [PYSPARK] Added StreamingContext.getActiveOrCreate() in Python
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8080 from tdas/SPARK-9572 and squashes the following commits:

64a231d [Tathagata Das] Fix based on comments
741a0d0 [Tathagata Das] Fixed style
f4f094c [Tathagata Das] Tweaked test
9afcdbe [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9572
e21488d [Tathagata Das] Minor update
1a371d9 [Tathagata Das] Addressed comments.
60479da [Tathagata Das] Fixed indent
9c2da9c [Tathagata Das] Fixed bugs
b5bd32c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9572
b55b348 [Tathagata Das] Removed prints
5781728 [Tathagata Das] Fix style issues
b711214 [Tathagata Das] Reverted run-tests.py
643b59d [Tathagata Das] Revert unnecessary change
150e58c [Tathagata Das] Added StreamingContext.getActiveOrCreate() in Python
2015-08-11 12:02:28 -07:00
Tathagata Das 0f90d6055e [SPARK-9640] [STREAMING] [TEST] Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #7961 from tdas/SPARK-9640 and squashes the following commits:

974ce19 [Tathagata Das] Undo changes related to SPARK-9727
004ae26 [Tathagata Das] style fixes
9bbb97d [Tathagata Das] Minor style fies
e6a677e [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9640
ca90719 [Tathagata Das] Removed extra line
ba9cfc7 [Tathagata Das] Improved kinesis test selection logic
88d59bd [Tathagata Das] updated test modules
871fcc8 [Tathagata Das] Fixed SparkBuild
94be631 [Tathagata Das] Fixed style
b858196 [Tathagata Das] Fixed conditions and few other things based on PR comments.
e292e64 [Tathagata Das] Added filters for Kinesis python tests
2015-08-10 23:41:53 -07:00
Prabeesh K 853809e948 [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python
This PR is based on #4229, thanks prabeesh.

Closes #4229

Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>

Closes #7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
2015-08-10 16:33:23 -07:00
zsxwing 3afc1de89c [SPARK-8564] [STREAMING] Add the Python API for Kinesis
This PR adds the Python API for Kinesis, including a Python example and a simple unit test.

Author: zsxwing <zsxwing@gmail.com>

Closes #6955 from zsxwing/kinesis-python and squashes the following commits:

e42e471 [zsxwing] Merge branch 'master' into kinesis-python
455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
5082d28 [zsxwing] Fix the syntax error for Python 2.6
fca416b [zsxwing] Fix wrong comparison
96670ff [zsxwing] Fix the compilation error after merging master
756a128 [zsxwing] Merge branch 'master' into kinesis-python
6c37395 [zsxwing] Print stack trace for debug
7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
cc9d071 [zsxwing] Fix the python test errors
466b425 [zsxwing] Add python tests for Kinesis
e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
3da2601 [zsxwing] Fix the kinesis folder
687446b [zsxwing] Fix the error message and the maven output path
add2beb [zsxwing] Merge branch 'master' into kinesis-python
4957c0b [zsxwing] Add the Python API for Kinesis
2015-07-31 12:09:48 -07:00
jerryshao 3ccebf36c5 [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python
This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like:

```
dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
```

Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange.

Author: jerryshao <saisai.shao@intel.com>

Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:

4c6d320 [jerryshao] Another way to fix subclass deserialization issue
e6a8011 [jerryshao] Address the comments
fd13937 [jerryshao] Fix serialization bug
7debf1c [jerryshao] bug fix
cff3893 [jerryshao] refactor the code according to the comments
2aabf9e [jerryshao] Style fix
848c708 [jerryshao] Add HasOffsetRanges for Python
2015-07-09 13:54:44 -07:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Josh Rosen 40648c56cd [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system
This patch refactors the `python/run-tests` script:

- It's now written in Python instead of Bash.
- The descriptions of the tests to run are now stored in `dev/run-tests`'s modules.  This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes.  For example, we can now skip the PySpark Streaming test cases when only SQL files are changed.
- `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482):

  ```
Usage: run-tests [options]

Options:
  -h, --help            show this help message and exit
  --python-executables=PYTHON_EXECUTABLES
                        A comma-separated list of Python executables to test
                        against (default: python2.6,python3.4,pypy)
  --modules=MODULES     A comma-separated list of Python modules to test
                        (default: pyspark-core,pyspark-ml,pyspark-mllib
                        ,pyspark-sql,pyspark-streaming)
   ```
- `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits:

f578d6d [Josh Rosen] Fix print for Python 2.x
8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks
34c98d2 [Josh Rosen] Fix universal_newlines for Python 3
8f65ed0 [Josh Rosen] Fix handling of  module in python/run-tests
37aff00 [Josh Rosen] Python 3 fix
27a389f [Josh Rosen] Skip MLLib tests for PyPy
c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests
568a3fd [Josh Rosen] Fix hashbang
3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test)
f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3
9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON
d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules
4f8902c [Josh Rosen] Python lint fixes.
8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3.
f542ac5 [Josh Rosen] Fix lint check for Python 3
fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks
2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags
b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests
caeb040 [Josh Rosen] Fixes to PySpark test module definitions
d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests
def2d8a [Josh Rosen] Two minor fixes
aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly
04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script
4c97136 [Josh Rosen] PYTHONPATH fixes
dcc9c09 [Josh Rosen] Fix time division
32660fc [Josh Rosen] Initial cut at Python test runner refactoring
311c6a9 [Josh Rosen] Move shell utility functions to own module.
1bdeb87 [Josh Rosen] Move module definitions to separate file.
2015-06-27 20:24:34 -07:00