Commit graph

4918 commits

Author SHA1 Message Date
Reynold Xin caba162861 Added join and aggregateUsingIndex to VertexPartition. 2013-11-26 21:02:39 -08:00
Reynold Xin 2d19d0381b Merge branch 'simplify' into clean 2013-11-26 13:55:26 -08:00
Reynold Xin d58bfa8573 Code cleaning to improve readability. 2013-11-26 13:54:46 -08:00
Reynold Xin d074e4c6ab Bring PrimitiveVector up to date. 2013-11-26 02:49:41 -08:00
Reynold Xin 088995f917 Merge pull request #77 from amplab/upgrade
Sync with Spark master
2013-11-25 00:57:51 -08:00
Reynold Xin 6bcac986b2 Merge branch 'master' of github.com:apache/incubator-spark 2013-11-25 15:47:47 +08:00
Reynold Xin 62889c419c Merge pull request #203 from witgo/master
Fix Maven build for metrics-graphite
2013-11-25 11:27:45 +08:00
LiGuoqiang 989203604e Fix Maven build for metrics-graphite 2013-11-25 11:23:11 +08:00
Ankur Dave 6af03edcf1 Merge pull request #76 from dcrankshaw/fix_partitioners
Actually use partitioner command line args in Analytics.
2013-11-24 16:42:37 -08:00
Dan Crankshaw 4b6b15dadd Actually use partitioner command line args in Analytics. 2013-11-24 16:38:38 -08:00
Matei Zaharia 859d62dc2a Merge pull request #151 from russellcardullo/add-graphite-sink
Add graphite sink for metrics

This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.
2013-11-24 16:19:51 -08:00
Matei Zaharia 65de73c7f8 Merge pull request #185 from mkolod/random-number-generator
XORShift RNG with unit tests and benchmark

This patch was introduced to address SPARK-950 - the discussion below the ticket explains not only the rationale, but also the design and testing decisions: https://spark-project.atlassian.net/browse/SPARK-950

To run unit test, start SBT console and type:
compile
test-only org.apache.spark.util.XORShiftRandomSuite
To run benchmark, type:
project core
console
Once the Scala console starts, type:
org.apache.spark.util.XORShiftRandom.benchmark(100000000)
XORShiftRandom is also an object with a main method taking the
number of iterations as an argument, so you can also run it
from the command line.
2013-11-24 15:52:33 -08:00
Reynold Xin 972171b9d9 Merge pull request #197 from aarondav/patrick-fix
Fix 'timeWriting' stat for shuffle files

Due to concurrent git branches, changes from shuffle file consolidation patch
caused the shuffle write timing patch to no longer actually measure the time,
since it requires time be measured after the stream has been closed.
2013-11-25 07:50:46 +08:00
Reynold Xin a1a7e3627c Merge pull request #75 from amplab/simplify
Simplify GraphImpl internals
2013-11-24 05:15:09 -08:00
Reynold Xin 718cc803f7 Merge pull request #200 from mateiz/hash-fix
AppendOnlyMap fixes

- Chose a more random reshuffling step for values returned by Object.hashCode to avoid some long chaining that was happening for consecutive integers (e.g. `sc.makeRDD(1 to 100000000, 100).map(t => (t, t)).reduceByKey(_ + _).count`)
- Some other small optimizations throughout (see commit comments)
2013-11-24 11:02:02 +08:00
Matei Zaharia 9837a60234 Some other optimizations to AppendOnlyMap:
- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique
- Remember the grow threshold instead of recomputing it on each insert
2013-11-23 17:38:29 -08:00
Matei Zaharia 7535d7fbcb Fixes to AppendOnlyMap:
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.
- Use Object.equals() instead of Scala's == to compare keys, because the
  latter does extra casts for numeric types (see the equals method in
  https://github.com/scala/scala/blob/master/src/library/scala/runtime/BoxesRunTime.java)
2013-11-23 17:21:37 -08:00
Reynold Xin 51aa9d6e99 Merge pull request #198 from ankurdave/zipPartitions-preservesPartitioning
Support preservesPartitioning in RDD.zipPartitions

In `RDD.zipPartitions`, add support for a `preservesPartitioning` option (similar to `RDD.mapPartitions`) that reuses the first RDD's partitioner.
2013-11-23 19:46:46 +08:00
Ankur Dave c1507afc6c Support preservesPartitioning in RDD.zipPartitions 2013-11-23 03:03:31 -08:00
Ankur Dave fad6e70add Simplify GraphImpl internals 2013-11-23 02:59:56 -08:00
Ankur Dave ad56ae7bfd Support preservesPartitioning in RDD.zipPartitions 2013-11-23 02:32:37 -08:00
Reynold Xin 18ce7e940b Merge pull request #73 from jegonzal/TriangleCount
Triangle count
2013-11-22 17:02:40 -08:00
Aaron Davidson ccea38b759 Fix 'timeWriting' stat for shuffle files
Due to concurrent git branches, changes from shuffle file consolidation patch
caused the shuffle write timing patch to no longer actually measure the time,
since it requires time be measured after the stream has been closed.
2013-11-21 21:36:08 -08:00
Reynold Xin 086b097e33 Merge pull request #193 from aoiwelle/patch-1
Fix Kryo Serializer buffer documentation inconsistency

The documentation here is inconsistent with the coded default and other documentation.
2013-11-22 10:26:39 +08:00
Reynold Xin f20093c3af Merge pull request #196 from pwendell/master
TimeTrackingOutputStream should pass on calls to close() and flush().

Without this fix you get a huge number of open files when running shuffles.
2013-11-22 10:12:13 +08:00
Patrick Wendell 53b94ef2f5 TimeTrackingOutputStream should pass on calls to close() and flush().
Without this fix you get a huge number of open shuffles after running
shuffles.
2013-11-21 17:20:15 -08:00
Neal Wiggins 21b5478ed6 Fix Kryo Serializer buffer inconsistency
The documentation here is inconsistent with the coded default and other documentation.
2013-11-20 16:19:25 -08:00
Reynold Xin 2fead510f7 Merge branch 'master' of github.com:tbfenet/incubator-spark
PartitionPruningRDD is using index from parent

I was getting a ArrayIndexOutOfBoundsException exception after doing union on pruned RDD. The index it was using on the partition was the index in the original RDD not the new pruned RDD.
2013-11-21 07:15:55 +08:00
Matei Zaharia 4b895013cc Merge pull request #191 from hsaputra/removesemicolonscala
Cleanup to remove semicolons (;) from Scala code

-) The main reason for this PR is to remove semicolons from single statements of Scala code.
-) Remove unused imports as I see them
-) Fix ASF comment header from some of files (bad copy paste I suppose)
2013-11-20 10:36:10 -08:00
Marek Kolodziej 22724659db Make XORShiftRandom explicit in KMeans and roll it back for RDD 2013-11-20 07:03:36 -05:00
Reynold Xin 202f8e62f2 Merge pull request #74 from dcrankshaw/remove_sleep
Removed sleep from pagerank in Analytics
2013-11-20 03:26:08 -08:00
Joseph E. Gonzalez de3d6ee5a7 Fixing build after merging upstream changes. 2013-11-19 22:03:49 -08:00
Joseph E. Gonzalez 12cb19b1c1 Adding comments and addressing comments. 2013-11-19 21:37:29 -08:00
Joseph E. Gonzalez ae4ffc319a Setting the initial vertex set size to be small. 2013-11-19 21:36:15 -08:00
Joseph E. Gonzalez 18700b6e74 Switching mapReduceTriplets mapFunction to return iterator instead of array to allow optimizations of the returned object. 2013-11-19 21:36:15 -08:00
Joseph E. Gonzalez 983810ad69 Now with style. Addressing most of Reynolds comments. 2013-11-19 21:35:03 -08:00
Joseph E. Gonzalez 2093a17ff3 Adding triangle count code 2013-11-19 21:35:03 -08:00
Joseph E. Gonzalez 8719ba83c8 Modifying graph loaders to create initial vertex sets more efficiently and load undirected graphs. 2013-11-19 21:35:02 -08:00
Joseph E. Gonzalez 288ae310e7 adding test for collectNeighborIds 2013-11-19 21:03:00 -08:00
Joseph E. Gonzalez 2fc6f5bd47 Switching collectNeighborIds to use mapReduceTriplets directly 2013-11-19 21:03:00 -08:00
Joseph E. Gonzalez b12b2ccde8 Addressing bug in open hash set where getPos on a full open hash set could loop forever. 2013-11-19 21:03:00 -08:00
Dan Crankshaw 96fafdbd4b Removed sleep from pagerank in Analytics. 2013-11-19 20:39:34 -08:00
Marek Kolodziej bcc6ed30bf Formatting and scoping (private[spark]) updates 2013-11-19 20:50:38 -05:00
Henry Saputra 43dfac5132 Merge branch 'master' into removesemicolonscala 2013-11-19 16:57:57 -08:00
Henry Saputra 10be58f251 Another set of changes to remove unnecessary semicolon (;) from Scala code.
Passed the sbt/sbt compile and test
2013-11-19 16:56:23 -08:00
Ankur Dave 74ade9e035 Merge pull request #62 from dcrankshaw/partitioners
Allow user to choose a partitioner at runtime
2013-11-19 16:53:58 -08:00
Dan Crankshaw 34bcf1b32b Re-added slaves file for compatibility with Spark 2013-11-19 16:46:25 -08:00
Dan Crankshaw 37a524d91c Addressed code review comments. 2013-11-19 16:39:39 -08:00
Matei Zaharia f568912f85 Merge pull request #181 from BlackNiuza/fix_tasks_number
correct number of tasks in ExecutorsUI

Index `a` is not `execId` here
2013-11-19 16:11:31 -08:00
Matei Zaharia aa638ed9c1 Merge pull request #189 from tgravescs/sparkYarnErrorHandling
Impove Spark on Yarn Error handling

Improve cli error handling and only allow a certain number of worker failures before failing the application.  This will help prevent users from doing foolish things and their jobs running forever.  For instance using 32 bit java but trying to allocate 8G containers. This loops forever without this change, now it errors out after a certain number of retries.  The number of tries is configurable.  Also increase the frequency we ping the RM to increase speed at which we get containers if they die. The Yarn MR app defaults to pinging the RM every 1 seconds, so the default of 5 seconds here is fine. But that is configurable as well in case people want to change it.

I do want to make sure there aren't any cases that calling stopExecutors in CoarseGrainedSchedulerBackend would cause problems?  I couldn't think of any and testing on standalone cluster as well as yarn.
2013-11-19 16:05:44 -08:00