Commit graph

2684 commits

Author SHA1 Message Date
Prashant Sharma 436f3d2856 ignoring tests for now, contrary to what I assumed these tests make sense given what they are testing. 2014-01-02 16:08:35 +05:30
Patrick Wendell f8d245bdfc Merge remote-tracking branch 'apache-github/master' into log4j-fix-2
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2014-01-01 16:10:51 -08:00
liguoqiang b5d0b3b0f7 restore core/pom.xml file modification 2014-01-01 11:30:08 +08:00
Reynold Xin 8b8e70ebde Merge pull request #73 from falaki/ApproximateDistinctCount
Approximate distinct count

Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
2013-12-31 17:48:24 -08:00
Patrick Wendell 37c43c9dd1 Adding outer checkout when initializing logging 2013-12-31 17:36:56 -08:00
Hossein Falaki bee445c927 Made the code more compact and readable 2013-12-31 16:58:18 -08:00
Hossein Falaki acb0323053 minor improvements 2013-12-31 15:34:26 -08:00
Patrick Wendell 63b411dd86 Merge pull request #238 from ngbinh/upgradeNetty
upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final

the changes are listed at https://github.com/netty/netty/wiki/New-and-noteworthy
2013-12-31 14:31:28 -08:00
Patrick Wendell 55b7e2fdff Merge pull request #289 from tdas/filestream-fix
Bug fixes for file input stream and checkpointing

- Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.)
- Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration.
- Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten.
- Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.
2013-12-31 10:12:51 -08:00
Tathagata Das fcd17a1e8e Fixed comments and long lines based on comments on PR 289. 2013-12-31 02:01:45 -08:00
Patrick Wendell 4abb0c57ab Tiny typo fix 2013-12-31 00:05:03 -08:00
Patrick Wendell 4d009dcac6 Removing use in test 2013-12-31 00:01:44 -08:00
Patrick Wendell 3c254f2eec Minor fixes 2013-12-30 23:55:33 -08:00
Patrick Wendell 18181e6c41 Removing initLogging entirely 2013-12-30 23:39:47 -08:00
Hossein Falaki d6cded7155 Added Java unit tests for countApproxDistinct and countApproxDistinctByKey 2013-12-30 19:32:05 -08:00
Hossein Falaki c3073b6cf2 Added Java API for countApproxDistinct 2013-12-30 19:31:06 -08:00
Hossein Falaki ed06500d30 Added Java API for countApproxDistinctByKey 2013-12-30 19:30:42 -08:00
Hossein Falaki a7de8e9b1c Renamed countDistinct and countDistinctByKey methods to include Approx 2013-12-30 19:28:03 -08:00
Hossein Falaki d50ccc5ca9 Using origin version 2013-12-30 15:08:34 -08:00
Patrick Wendell 1cbef081e3 Response to Shivaram's review 2013-12-30 12:46:09 -08:00
Patrick Wendell 50e3b8ec4c Merge pull request #308 from kayousterhout/stage_naming
Changed naming of StageCompleted event to be consistent

The rest of the SparkListener events are named with "SparkListener"
as the prefix of the name; this commit renames the StageCompleted
event to SparkListenerStageCompleted for consistency.
2013-12-30 07:44:26 -08:00
Patrick Wendell cffe1c1d5c SPARK-1008: Logging improvments
1. Adds a default log4j file that gets loaded if users haven't specified a log4j file.
2. Isolates use of the tools assembly jar. I found this produced SLF4J warnings
   after building with SBT (and I've seen similar warnings on the mailing list).
2013-12-29 23:14:33 -08:00
Kay Ousterhout c2c1af39f5 Updated code style according to Patrick's comments 2013-12-29 21:10:08 -08:00
Patrick Wendell 7375047d51 Merge pull request #304 from kayousterhout/remove_unused
Removed unused failed and causeOfFailure variables (in TaskSetManager)
2013-12-28 13:25:06 -08:00
Matei Zaharia ad3dfd1531 Merge pull request #307 from kayousterhout/other_failure
Removed unused OtherFailure TaskEndReason.

The OtherFailure TaskEndReason was added by @mateiz 3 years ago in this commit: 24a1e7f838

Unless I am missing something, it doesn't seem to have been used then, and is not used now, so seems safe for deletion.
2013-12-27 22:10:14 -05:00
Kay Ousterhout b4619e509b Changed naming of StageCompleted event to be consistent
The rest of the SparkListener events are named with "SparkListener"
as the prefix of the name; this commit renames the StageCompleted
event to SparkListenerStageCompleted for consistency.
2013-12-27 17:45:20 -08:00
Kay Ousterhout e17d7518ab Removed unused OtherFailure TaskEndReason. 2013-12-27 15:51:27 -08:00
Kay Ousterhout 8419148e5f Remove unused hasPendingTasks methods 2013-12-27 15:19:42 -08:00
Kay Ousterhout 0c71ffe924 Style fixes as per Reynold's review 2013-12-27 12:19:38 -08:00
Kay Ousterhout 8c81068e16 Fixed >100char lines in DAGScheduler.scala 2013-12-27 11:36:54 -08:00
Binh Nguyen 2c5bade4ee Fix failed unit tests
Also clean up a bit.
2013-12-27 11:24:30 -08:00
Kay Ousterhout baaabcedc9 Removed unused failed and causeOfFailure variables 2013-12-27 11:12:36 -08:00
Reynold Xin 7be1e57786 Merge pull request #298 from aarondav/minor
Minor: Decrease margin of left side of Log page

Before
![before](https://f.cloud.github.com/assets/1400247/1812647/1a4be53e-6e87-11e3-9d5b-f851274be0e9.png)

After
![after](https://f.cloud.github.com/assets/1400247/1812648/1ca1ea2c-6e87-11e3-946c-31be9258f450.png)

It's a start anyway...
2013-12-26 23:41:40 -10:00
Aaron Davidson 4f2fb761b0 Decrease margin of left side of log page 2013-12-26 15:38:45 -08:00
Mark Hamstra c529dceaff Avoid a lump of coal (NPE) in JobProgressListener's stocking. 2013-12-25 23:10:02 -08:00
Patrick Wendell 85a344b4f0 Merge pull request #127 from kayousterhout/consolidate_schedulers
Deduplicate Local and Cluster schedulers.

The code in LocalScheduler/LocalTaskSetManager was nearly identical
to the code in ClusterScheduler/ClusterTaskSetManager. The redundancy
made making updating the schedulers unnecessarily painful and error-
prone. This commit combines the two into a single TaskScheduler/
TaskSetManager.

Unfortunately the diff makes this change look much more invasive than it is -- TaskScheduler.scala is only superficially changed (names updated, overrides removed) from the old ClusterScheduler.scala, and the same with
TaskSetManager.scala.

Thanks @rxin for suggesting this change!
2013-12-24 16:35:06 -08:00
Binh Nguyen 786f393a98 Fix imports order 2013-12-24 14:59:30 -08:00
Binh Nguyen 9115a5de62 Remove import * and fix some formatting 2013-12-24 14:59:30 -08:00
Binh Nguyen 040dd3ecd5 upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final 2013-12-24 14:58:18 -08:00
Patrick Wendell c2dd6bcd6e Merge pull request #279 from aarondav/shuffle-cleanup0
Clean up shuffle files once their metadata is gone

Previously, we would only clean the in-memory metadata for consolidated shuffle files.

Additionally, fixes a bug where the Metadata Cleaner was ignoring type-specific TTLs.
2013-12-24 14:36:47 -08:00
Kay Ousterhout 1efe3adf56 Responded to Reynold's style comments 2013-12-24 14:18:39 -08:00
Tathagata Das d4dfab503a Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. 2013-12-24 14:01:13 -08:00
Tathagata Das 9f79fd89dc Merge branch 'apache-master' into filestream-fix 2013-12-24 11:38:17 -08:00
Matei Zaharia 23a9ae6be3 Merge pull request #277 from tdas/scheduler-update
Refactored the streaming scheduler and added StreamingListener interface

- Refactored the streaming scheduler for cleaner code. Specifically, the JobManager was renamed to JobScheduler, as it does the actual scheduling of Spark jobs to the SparkContext. The earlier Scheduler was renamed to JobGenerator, as it actually generates the jobs from the DStreams. The JobScheduler starts the JobGenerator. Also, moved all the scheduler related code from spark.streaming to spark.streaming.scheduler package.
- Implemented the StreamingListener interface, similar to SparkListener. The streaming version of StatusReportListener prints the batch processing time statistics (for now). Added StreamingListernerSuite to test it.
- Refactored streaming TestSuiteBase for deduping code in the other streaming testsuites.
2013-12-24 00:08:48 -05:00
Reynold Xin 11107c9de5 Merge pull request #244 from leftnoteasy/master
Added SPARK-968 implementation for review

Added SPARK-968 implementation for review
2013-12-23 10:38:20 -08:00
wangda.tan 2f689ba97b SPARK-968, added executor address showing in aggregated metrics by executors table 2013-12-23 15:03:45 +08:00
Kay Ousterhout b7bfae1afe Correctly merged in maxTaskFailures fix 2013-12-22 07:34:44 -08:00
wangda.tan c979eecdf6 added changes according to comments from rxin 2013-12-22 21:43:15 +08:00
Kay Ousterhout b8ae096a40 Fix build error in test 2013-12-21 23:28:48 -08:00
Kay Ousterhout 30186aa264 Renamed ClusterScheduler to TaskSchedulerImpl 2013-12-20 14:58:04 -08:00