Commit graph

2129 commits

Author SHA1 Message Date
Evan Chan 27726079e4 Print out more friendly error if listFiles() fails
listFiles() could return null if the I/O fails, and this currently results in an ugly NPE which is hard to diagnose.
2013-09-09 12:58:12 -07:00
Y.CORP.YAHOO.COM\tgraves 2186d93285 Add metrics-ganglia to core pom file 2013-09-09 12:37:33 -05:00
Stephen Haberman 59003d387d Use a set since shuffle could change order. 2013-09-09 11:45:03 -05:00
Stephen Haberman 6471bfec73 Reword 'evenly distributed' to 'distributed with a hash partitioner. 2013-09-09 11:44:15 -05:00
Matei Zaharia bf984e2745 Merge pull request #890 from mridulm/master
Fix hash bug
2013-09-08 23:50:24 -07:00
Reynold Xin e9d4f44a7a Merge pull request #909 from mateiz/exec-id-fix
Fix an instance where full standalone mode executor IDs were passed to
2013-09-08 23:36:48 -07:00
Matei Zaharia 7d3204b056 Merge pull request #905 from mateiz/docs2
Job scheduling and cluster mode docs
2013-09-08 21:39:12 -07:00
Patrick Wendell f68848d95d Merge pull request #906 from pwendell/ganglia-sink
Clean-up of Metrics Code/Docs and Add Ganglia Sink
2013-09-08 18:32:16 -07:00
Matei Zaharia f9b7f58de2 Fix an instance where full standalone mode executor IDs were passed to
StandaloneSchedulerBackend instead of the smaller IDs used within Spark
(that lack the application name).

This was reported by ClearStory in
https://github.com/clearstorydata/spark/pull/9.

Also fixed some messages that said slave instead of executor.
2013-09-08 18:27:50 -07:00
Matei Zaharia 170b3869ee Fix unit test failure due to changed default 2013-09-08 17:51:27 -07:00
Patrick Wendell b4e382c210 Adding sc name in metrics source 2013-09-08 16:06:49 -07:00
Patrick Wendell c190b48bf5 Adding more docs and some code cleanup 2013-09-08 13:46:28 -07:00
Stephen Haberman df5fd35273 Add better docs for coalesce.
Include the useful tip that if shuffle=true, coalesce can actually
increase the number of partitions.

This makes coalesce more like a generic `RDD.repartition` operation.

(Ideally this `RDD.repartition` could automatically choose either a coalesce or
a shuffle if numPartitions was either less than or greater than, respectively,
the current number of partitions.)
2013-09-08 15:39:04 -05:00
Matei Zaharia 04cfb3aa9d Merge pull request #898 from ilikerps/660
SPARK-660: Add StorageLevel support in Python
2013-09-08 10:33:20 -07:00
Patrick Wendell 8de8ee5d3c Ganglia sink 2013-09-08 10:08:18 -07:00
Matei Zaharia 651a96adf7 More fair scheduler docs and property names.
Also changed uses of "job" terminology to "application" when they
referred to an entire Spark program, to avoid confusion.
2013-09-08 00:29:11 -07:00
Matei Zaharia 98fb69822c Work in progress:
- Add job scheduling docs
- Rename some fair scheduler properties
- Organize intro page better
- Link to Apache wiki for "contributing to Spark"
2013-09-08 00:29:11 -07:00
Aaron Davidson c1cc8c4da2 Export StorageLevel and refactor 2013-09-07 14:41:31 -07:00
Aaron Davidson 8001687af5 Remove reflection, hard-code StorageLevels
The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise
the shell would have to call a private method of SparkContext. Having
StorageLevel available in sc also doesn't seem like the end of the world.
There may be a better solution, though.

As for creating the StorageLevel object itself, this seems to be the best
way in Python 2 for creating singleton, enum-like objects:
http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
2013-09-07 09:34:07 -07:00
Reynold Xin 210eae26f4 Fixed the bug that ResultTask was not properly deserializing outputId. 2013-09-07 21:59:47 +08:00
Aaron Davidson b8a0b6ea5e Memoize StorageLevels read from JVM 2013-09-06 15:36:04 -07:00
Reynold Xin 1e15feb5a3 Hot fix to resolve the compilation error caused by SPARK-821. 2013-09-06 22:44:05 +08:00
Patrick Wendell ddcb9d310a Merge pull request #895 from ilikerps/821
SPARK-821: Don't cache results when action run locally on driver
2013-09-05 23:54:09 -07:00
Aaron Davidson a63d4c7dc2 SPARK-660: Add StorageLevel support in Python
It uses reflection... I am not proud of that fact, but it at least ensures
compatibility (sans refactoring of the StorageLevel stuff).
2013-09-05 23:36:27 -07:00
Aaron Davidson 3a04e76c89 Reynold's second round of comments 2013-09-05 21:43:26 -07:00
Matei Zaharia 699c331f2f Merge pull request #891 from xiajunluan/SPARK-864
[SPARK-864]DAGScheduler Exception if we delete Worker and StandaloneExecutorBackend then add Worker
2013-09-05 20:21:53 -07:00
Aaron Davidson 4f2236a1c5 Add unit test and address comments 2013-09-05 18:06:30 -07:00
Aaron Davidson 1418d18af4 SPARK-821: Don't cache results when action run locally on driver
Caching the results of local actions (e.g., rdd.first()) causes the driver to
store entire partitions in its own memory, which may be highly constrained.
This patch simply makes the CacheManager avoid caching the result of all locally-run computations.
2013-09-05 15:34:42 -07:00
Andrew xia 7c15e3c5de Fix bug SPARK-864 2013-09-05 15:56:11 +08:00
Patrick Wendell 5c7494d7c1 Merge pull request #893 from ilikerps/master
SPARK-884: Add unit test to validate Spark JSON output
2013-09-04 22:47:03 -07:00
Aaron Davidson 714e7f9e32 Fix line over 100 chars 2013-09-04 22:40:08 -07:00
Aaron Davidson 37db141aef Address Patrick's comments 2013-09-04 21:34:20 -07:00
Aaron Davidson 9e6f2b6822 SPARK-884: Add unit test to validate Spark JSON output
This unit test simply validates that the outputs of
the JsonProtocol methods are syntactically valid JSON.
2013-09-04 15:26:46 -07:00
Mridul Muralidharan 1e2474b814 Address review comments - rename toHash to nonNegativeHash 2013-09-04 07:46:46 +05:30
Mridul Muralidharan b3a82b7df3 Fix hash bug - caused failure after 35k stages, sigh 2013-09-04 07:02:25 +05:30
Mark Hamstra c9bc8af3d1 Removed repetative import; fixes hidden definition compiler warning. 2013-09-03 15:25:20 -07:00
Patrick Wendell c592a3c9b9 Minor spacing fix 2013-09-03 14:39:11 -07:00
Patrick Wendell 19f70273d2 Merge pull request #878 from tgravescs/yarnUILink
Link the Spark UI up to the Yarn UI
2013-09-03 14:29:10 -07:00
Matei Zaharia 68df2464d1 Merge pull request #889 from alig/master
Return the port the WebUI is bound to (useful if port 0 was used)
2013-09-03 13:01:17 -07:00
Y.CORP.YAHOO.COM\tgraves 41c1b5b9a0 Update based on review comments. Change function to prependBaseUri and fix formatting. 2013-09-03 14:46:51 -05:00
Y.CORP.YAHOO.COM\tgraves c8cc276110 Review comment changes and update to org.apache packaging 2013-09-03 10:50:21 -05:00
Y.CORP.YAHOO.COM\tgraves 547fc4a412 Merge remote-tracking branch 'mesos/master' into yarnUILink
Conflicts:
	core/src/main/scala/org/apache/spark/ui/UIUtils.scala
	core/src/main/scala/org/apache/spark/ui/jobs/PoolTable.scala
	core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala
	docs/running-on-yarn.md
2013-09-03 08:36:59 -05:00
Ali Ghodsi b25918d841 Merge branch 'master' of https://github.com/alig/spark
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
2013-09-03 00:56:12 -07:00
Ali Ghodsi bd0788505f Using configured akka timeouts 2013-09-03 00:50:35 -07:00
Ali Ghodsi cbfef9b3ff Sort order of imports to match project guidelines 2013-09-02 19:33:55 -07:00
Ali Ghodsi 36d8fca2cc Reynold's comment fixed 2013-09-02 19:31:09 -07:00
Ali Ghodsi e452bd6d77 Brushing the code up slightly 2013-09-02 19:04:08 -07:00
Ali Ghodsi cf7b115496 Enabling getting the actual WEBUI port 2013-09-02 18:21:21 -07:00
Matei Zaharia 12b2f1f9c9 Add missing license headers found with RAT 2013-09-02 12:23:03 -07:00
Matei Zaharia 246bf67f58 Fix test 2013-09-02 10:57:34 -07:00
Matei Zaharia 9329a7d4cd Fix spark.io.compression.codec and change default codec to LZF 2013-09-02 10:15:22 -07:00
Matei Zaharia 6550e5e60c Allow PySpark to launch worker.py directly on Windows 2013-09-01 18:06:15 -07:00
Matei Zaharia 3db404a43a Run script fixes for Windows after package & assembly change 2013-09-01 23:45:57 +00:00
Matei Zaharia 0a8cc30921 Move some classes to more appropriate packages:
* RDD, *RDDFunctions -> org.apache.spark.rdd
* Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util
* JavaSerializer, KryoSerializer -> org.apache.spark.serializer
2013-09-01 14:13:16 -07:00
Matei Zaharia 5701eb92c7 Fix some URLs 2013-09-01 14:13:16 -07:00
Matei Zaharia 12495ec63a Remove shutdown hook to stop jetty; this is unnecessary for releasing
ports and creates noisy log messages
2013-09-01 14:13:15 -07:00
Matei Zaharia 46eecd110a Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
Matei Zaharia a30fac16ca Merge pull request #883 from alig/master
Don't require the spark home environment variable to be set for standalone mode (change needed by SIMR)
2013-09-01 12:27:50 -07:00
Matei Zaharia e34bc3a8ee Small tweak 2013-08-31 17:47:15 -07:00
Matei Zaharia 2ee6a7e32a Print output from spark-daemon only when it fails to launch 2013-08-31 17:31:07 -07:00
Ali Ghodsi 250bddc255 Don't require spark home to be set for standalone mode 2013-08-31 17:29:05 -07:00
Matei Zaharia 25ac50668b Various web UI improvements:
- Use "fluid" layout that can expand to wide browser windows, instead of
  the old one's limit of 1200 px
- Remove unnecessary <hr> elements
- Switch back to Bootstrap's default theme and tweak progress bar colors
- Make headers more consistent between deploy and app UIs
- Replace some inline CSS with stylesheets
2013-08-31 16:55:40 -07:00
Y.CORP.YAHOO.COM\tgraves 96452eea56 fix up minor things 2013-08-30 16:04:31 -05:00
Y.CORP.YAHOO.COM\tgraves bac46266a9 Link the Spark UI to the Yarn UI 2013-08-30 15:55:32 -05:00
Mikhail Bautin 35090958b3 Also add getConf to NewHadoopRDD 2013-08-30 11:03:57 -07:00
Mikhail Bautin 5e30172f70 Make HadoopRDD's configuration accessible 2013-08-30 11:01:06 -07:00
Matei Zaharia ca71620950 Merge pull request #857 from mateiz/assembly
Change build and run instructions to use assemblies
2013-08-29 21:51:14 -07:00
Matei Zaharia 666d93c294 Update Maven build to create assemblies expected by new scripts
This includes the following changes:
- The "assembly" package now builds in Maven by default, and creates an
  assembly containing both hadoop-client and Spark, unlike the old
  BigTop distribution assembly that skipped hadoop-client
- There is now a bigtop-dist package to build the old BigTop assembly
- The repl-bin package is no longer built by default since the scripts
  don't reply on it; instead it can be enabled with -Prepl-bin
- Py4J is now included in the assembly/lib folder as a local Maven repo,
  so that the Maven package can link to it
- run-example now adds the original Spark classpath as well because the
  Maven examples assembly lists spark-core and such as provided
- The various Maven projects add a spark-yarn dependency correctly
2013-08-29 21:19:06 -07:00
Matei Zaharia aab345c463 Fix finding of assembly JAR, as well as some pointers to ./run 2013-08-29 21:19:06 -07:00
Matei Zaharia ab0e625d9e Fix PySpark for assembly run and include it in dist 2013-08-29 21:19:06 -07:00
Matei Zaharia 53cd50c069 Change build and run instructions to use assemblies
This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.
2013-08-29 21:19:04 -07:00
jerryshao f3dbe6b215 Fix removed block zero size log reporting 2013-08-30 09:39:01 +08:00
Patrick Wendell abdbacf252 Merge pull request #871 from pwendell/expose-local
Expose `isLocal` in SparkContext.
2013-08-28 21:11:31 -07:00
Patrick Wendell 30d2421112 Make local variable public 2013-08-28 19:53:31 -07:00
Matei Zaharia baa84e7e4c Merge pull request #865 from tgravescs/fixtmpdir
Spark on Yarn should use yarn approved directories for spark.local.dir and tmp
2013-08-28 12:44:46 -07:00
Y.CORP.YAHOO.COM\tgraves aac1214ee4 Change Executor to only look at the env variable SPARK_YARN_MODE 2013-08-28 13:26:26 -05:00
Y.CORP.YAHOO.COM\tgraves 3f206bf0b5 Updated based on review comments. 2013-08-27 14:34:27 -05:00
Y.CORP.YAHOO.COM\tgraves cf52a3cba6 Allow for Executors to have different directories then the Spark Master for Yarn 2013-08-27 11:00:21 -05:00
Reynold Xin a77e0abb96 Added worker state to the cluster master JSON ui. 2013-08-26 11:21:03 -07:00
Reynold Xin 9db1e50344 Revert "Merge pull request #841 from rxin/json"
This reverts commit 1fb1b09928, reversing
changes made to c69c48947d.
2013-08-26 11:05:14 -07:00
Matei Zaharia 8a36fd09dd Merge pull request #854 from markhamstra/pomUpdate
Synced sbt and maven builds to use the same dependencies, etc.
2013-08-22 10:13:35 -07:00
Matei Zaharia c2d00f12e2 Merge pull request #832 from alig/coalesce
Coalesced RDD with locality
2013-08-22 10:13:03 -07:00
Mark Hamstra ff6f1b0500 Synced sbt and maven builds 2013-08-21 13:50:24 -07:00
Mark Hamstra 5eea613ec0 Removed meaningless types 2013-08-20 16:49:18 -07:00
Ali Ghodsi f20ed14e87 Merged in from upstream to use TaskLocation instead of strings 2013-08-20 16:21:43 -07:00
Ali Ghodsi 5cd21c4195 added curly braces to make the code more consistent 2013-08-20 16:16:05 -07:00
Ali Ghodsi db4bc55bef indent 2013-08-20 16:16:05 -07:00
Ali Ghodsi c0942a710f Bug in test fixed 2013-08-20 16:16:05 -07:00
Ali Ghodsi 5db41919b5 Added a test to make sure no locality preferences are ignored 2013-08-20 16:16:05 -07:00
Ali Ghodsi 7b123b3126 Simpler code 2013-08-20 16:16:05 -07:00
Ali Ghodsi 9192c358e4 simpler code 2013-08-20 16:16:05 -07:00
Ali Ghodsi a75a64eade Fixed almost all of Matei's feedback 2013-08-20 16:16:05 -07:00
Ali Ghodsi f1c853d76d fixed Matei's comments 2013-08-20 16:16:04 -07:00
Ali Ghodsi 890ea6ba79 making CoalescedRDDPartition public 2013-08-20 16:16:04 -07:00
Ali Ghodsi d6b6c680be comment in the test to make it more understandable 2013-08-20 16:16:04 -07:00
Ali Ghodsi b69e7166ba Coalescer now uses current preferred locations for derived RDDs. Made run() in DAGScheduler thread safe and added a method to be able to ask it for preferred locations. Added a similar method that wraps the former inside SparkContext. 2013-08-20 16:16:04 -07:00
Ali Ghodsi 3b5bb8a4ae added one test that will test a future functionality 2013-08-20 16:13:37 -07:00
Ali Ghodsi 33a0f59354 Added error messages to the tests to make failed tests less cryptic 2013-08-20 16:13:37 -07:00
Ali Ghodsi abcefb3858 fixed matei's comments 2013-08-20 16:13:37 -07:00
Ali Ghodsi 35537e6341 Made a function object that returns the coalesced groups 2013-08-20 16:13:37 -07:00