Commit graph

4919 commits

Author SHA1 Message Date
Hossein Falaki ed06500d30 Added Java API for countApproxDistinctByKey 2013-12-30 19:30:42 -08:00
Hossein Falaki b75d7c98bc Added stream 2.5.1 jar depenency 2013-12-30 19:29:17 -08:00
Hossein Falaki a7de8e9b1c Renamed countDistinct and countDistinctByKey methods to include Approx 2013-12-30 19:28:03 -08:00
Hossein Falaki d50ccc5ca9 Using origin version 2013-12-30 15:08:34 -08:00
Reynold Xin d63856c361 Merge pull request #286 from rxin/build
Show full stack trace and time taken in unit tests.
2013-12-23 22:07:26 -08:00
Reynold Xin fc80b2e693 Show full stack trace and time taken in unit tests. 2013-12-23 21:20:20 -08:00
Matei Zaharia 23a9ae6be3 Merge pull request #277 from tdas/scheduler-update
Refactored the streaming scheduler and added StreamingListener interface

- Refactored the streaming scheduler for cleaner code. Specifically, the JobManager was renamed to JobScheduler, as it does the actual scheduling of Spark jobs to the SparkContext. The earlier Scheduler was renamed to JobGenerator, as it actually generates the jobs from the DStreams. The JobScheduler starts the JobGenerator. Also, moved all the scheduler related code from spark.streaming to spark.streaming.scheduler package.
- Implemented the StreamingListener interface, similar to SparkListener. The streaming version of StatusReportListener prints the batch processing time statistics (for now). Added StreamingListernerSuite to test it.
- Refactored streaming TestSuiteBase for deduping code in the other streaming testsuites.
2013-12-24 00:08:48 -05:00
Tathagata Das 6eaa050549 Minor change for PR 277. 2013-12-23 15:55:45 -08:00
Tathagata Das f9771690a6 Minor formatting fixes. 2013-12-23 11:32:26 -08:00
Tathagata Das dc3ee6b612 Added comments to BatchInfo and JobSet, based on Patrick's comment on PR 277. 2013-12-23 11:30:42 -08:00
Reynold Xin 11107c9de5 Merge pull request #244 from leftnoteasy/master
Added SPARK-968 implementation for review

Added SPARK-968 implementation for review
2013-12-23 10:38:20 -08:00
wangda.tan 2f689ba97b SPARK-968, added executor address showing in aggregated metrics by executors table 2013-12-23 15:03:45 +08:00
wangda.tan c979eecdf6 added changes according to comments from rxin 2013-12-22 21:43:15 +08:00
Tathagata Das 3ddbdbfbc7 Minor updated based on comments on PR 277. 2013-12-20 19:51:37 -08:00
Patrick Wendell 0bc57c5767 Merge pull request #280 from aarondav/minor
Minor cleanup for standalone scheduler

See commit messages
2013-12-20 11:56:54 -08:00
Patrick Wendell eca68d4425 Merge pull request #272 from tmyklebu/master
Track and report task result serialisation time.

 - DirectTaskResult now has a ByteBuffer valueBytes instead of a T value.
 - DirectTaskResult now has a member function T value() that deserialises valueBytes.
 - Executor serialises value into a ByteBuffer and passes it to DTR's ctor.
 - Executor tracks the time taken to do so and puts it in a new field in TaskMetrics.
 - StagePage now reports serialisation time from TaskMetrics along with the other things it reported.
2013-12-19 18:12:22 -08:00
Aaron Davidson 6613ab663d Fix compiler warning in SparkZooKeeperSession 2013-12-19 17:56:13 -08:00
Aaron Davidson 4d74b899b7 Remove firstApp from the standalone scheduler Master
As a lonely child with no one to care for it... we had to put it down.
2013-12-19 17:53:41 -08:00
Aaron Davidson 1ab031eaff Extraordinarily minor code/comment cleanup 2013-12-19 17:51:29 -08:00
Reynold Xin 7990c56375 Merge pull request #276 from shivaram/collectPartition
Add collectPartition to JavaRDD interface.

This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.

Thanks @concretevitamin for the original change and tests.
2013-12-19 13:35:09 -08:00
Shivaram Venkataraman 9cc3a6d3c0 Add comment explaining collectPartitions's use 2013-12-19 11:49:17 -08:00
Shivaram Venkataraman d3234f9726 Make collectPartitions take an array of partitions
Change the implementation to use runJob instead of PartitionPruningRDD.
Also update the unit tests and the python take implementation
to use the new interface.
2013-12-19 11:40:34 -08:00
Matei Zaharia 440e531a5e Merge pull request #278 from MLnick/java-python-tostring
Add toString to Java RDD, and __repr__ to Python RDD

Addresses [SPARK-992](https://spark-project.atlassian.net/browse/SPARK-992)
2013-12-19 10:38:56 -08:00
Nick Pentreath a76f53416c Add toString to Java RDD, and __repr__ to Python RDD 2013-12-19 14:38:20 +02:00
Reynold Xin d8d3f3e60d Merge pull request #183 from aarondav/spark-959
[SPARK-959] Explicitly depend on org.eclipse.jetty.orbit jar

Without this, in some cases, Ivy attempts to download the wrong file and fails, stopping the whole build. See [bug](https://spark-project.atlassian.net/browse/SPARK-959) for more details.

Note that this may not be the best solution, as I do not understand the root cause of why this only happens for some people. However, it is reported to work.
2013-12-19 00:06:43 -08:00
Tathagata Das ec71b445ad Minor changes. 2013-12-18 23:39:28 -08:00
Aaron Davidson eaf6a269b1 [SPARK-959] Explicitly depend on org.eclipse.jetty.orbit jar
Without this, in some cases, Ivy attempts to download the wrong file
and fails, stopping the whole build. See bug for more details.

(This is probably also the beginning of the slow death of our
recently prettified dependencies. Form follow function.)
2013-12-18 23:37:31 -08:00
Reynold Xin bfba5323be Merge pull request #247 from aarondav/minor
Increase spark.akka.askTimeout default to 30 seconds

In experimental clusters we've observed that a 10 second timeout was insufficient, despite having a low number of nodes and relatively small workload (16 nodes, <1.5 TB data). This would cause an entire job to fail at the beginning of the reduce phase.
There is no particular reason for this value to be small as a timeout should only occur in an exceptional situation.

Also centralized the reading of spark.akka.askTimeout to AkkaUtils (surely this can later be cleaned up to use Typesafe).

Finally, deleted some lurking implicits. If anyone can think of a reason they should still be there, please let me know.
2013-12-18 22:22:21 -08:00
Aaron Davidson 293a0af5a1 In experimental clusters we've observed that a 10 second timeout was insufficient,
despite having a low number of nodes and relatively small workload (16 nodes, <1.5 TB data).
This would cause an entire job to fail at the beginning of the reduce phase.
There is no particular reason for this value to be small as a timeout should only occur
in an exceptional situation.

Also centralized the reading of spark.akka.askTimeout to AkkaUtils (surely this can later
be cleaned up to use Typesafe).

Finally, deleted some lurking implicits. If anyone can think of a reason they should still
be there, please let me know.
2013-12-18 21:42:29 -08:00
Tathagata Das e93b391d75 Merge branch 'apache-master' into scheduler-update
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala
2013-12-18 17:51:14 -08:00
Reynold Xin c64a53a48b Merge pull request #267 from JoshRosen/cygwin
Fix Cygwin support in several scripts.

This allows the spark-shell, spark-class, run-example, make-distribution.sh,
and ./bin/start-* scripts to work under Cygwin.  Note that this doesn't
support PySpark under Cygwin, since that requires many additional `cygpath`
calls from within Python and will be non-trivial to implement.

This PR was inspired by, and subsumes, #253 (so close #253 after this is merged).
2013-12-18 16:56:26 -08:00
Tathagata Das b80ec05635 Added StatsReportListener to generate processing time statistics across multiple batches. 2013-12-18 15:35:24 -08:00
Reynold Xin 5ea187277c Merge pull request #274 from azuryy/master
Fixed the example link in the Scala programing guid.

The old link cannot access, I changed to the new one.
2013-12-18 15:27:24 -08:00
Shivaram Venkataraman af0cd6bd27 Add collectPartition to JavaRDD interface.
Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
2013-12-18 11:40:07 -08:00
Tor Myklebust d3b1af4b6c Add a serialisation time column to the StagePage. 2013-12-18 14:25:56 -05:00
fengdong ad8ce0148a changed the example links in the scala-programming-guid 2013-12-18 19:03:32 +08:00
Reynold Xin f4effb375e Merge pull request #273 from rxin/top
Fixed a performance problem in RDD.top and BoundedPriorityQueue

BoundedPriority was actually traversing the entire queue to calculate the size, resulting in bad performance in insertion.

This should also cherry pick cleanly into branch-0.8.
2013-12-17 22:26:21 -08:00
Tor Myklebust 717c7fddb2 objectSer -> valueSer in a test. 2013-12-17 23:02:21 -05:00
fengdong ddebaf8280 Fixed the example link. 2013-12-18 11:00:36 +08:00
Reynold Xin 9a6864d016 Fixed a performance problem in RDD.top and BoundedPriorityQueue (size in BoundedPriority was actually traversing the entire queue to calculate the size, resulting in bad performance in insertion). 2013-12-17 18:44:39 -08:00
wangda.tan 59e53fa21c spark-968, changes for avoid a NPE 2013-12-17 17:57:27 +08:00
wangda.tan 36060f4f50 spark-898, changes according to review comments 2013-12-17 17:55:38 +08:00
Patrick Wendell 7a8169be9a Merge pull request #268 from pwendell/shaded-protobuf
Add support for 2.2. to master (via shaded jars)

This patch does a few related things. NOTE: This may not compile correctly for ~24 hours until artifacts fully propagate to Maven Central.

1. Uses shaded versions of akka/protobuf. For more information on how these versions were prepared, see [1].

2. Brings the `new-yarn` project up-to-date with the changes for Akka 2.2.3.

3. Some clean-up of the build now that we don't have to switch akka groups for different YARN versions.

[1]
933a309ef8/shaded-protobuf
2013-12-16 22:42:21 -08:00
Patrick Wendell 10c0ffa1eb One other fix 2013-12-16 22:10:55 -08:00
Patrick Wendell c1c0f8099f Clean-up 2013-12-16 22:01:27 -08:00
Patrick Wendell c1fec89895 Cleanup 2013-12-16 21:56:21 -08:00
Patrick Wendell 24f8220dc8 Removing extra code in new yarn 2013-12-16 21:53:51 -08:00
Patrick Wendell ceb013f8b9 Remove trailing slashes from repository specifications.
The correct format is to not have a trailing slash.

For me this caused non-deterministic failures due to issues fetching
certain artifacts. The issue was that some of the maven caches would
fail to fetch the artifact (due to the way that the artifact
path was concatenated with the repository) and this short-circuited
the download process in a silent way. Here is what the log output
looked like:

    Downloading: http://repo.maven.apache.org/maven2/org/spark-project/akka/akka-remote_2.10/2.2.3-shaded-protobuf/akka-remote_2.10-2.2.3-shaded-protobuf.pom
    [WARNING] The POM for org.spark-project.akka:akka-remote_2.10🫙2.2.3-shaded-protobuf is missing, no dependency information available

This was pretty brutal to debug since there was no error message
anywhere and the path *looks* correct as reported by the Maven log.
2013-12-16 21:53:51 -08:00
Patrick Wendell c6f95e603e Attempt with extra repositories 2013-12-16 21:53:51 -08:00
Tor Myklebust b2f0329511 Missed a spot; had an objectSer here too. 2013-12-17 00:18:46 -05:00