ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Tathagata Das	04c37b6f74	[SPARK-1332] Improve Spark Streaming's Network Receiver and InputDStream API [WIP] The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51 Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability. Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. This is still not yet implemented. This PR is blocked on the graceful shutdown PR #247 Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #300 from tdas/network-receiver-api and squashes the following commits: ea27b38 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api 3a4777c [Tathagata Das] Renamed NetworkInputDStream to ReceiverInputDStream, and ActorReceiver related stuff. 838dd39 [Tathagata Das] Added more events to the StreamingListener to report errors and stopped receivers. a75c7a6 [Tathagata Das] Address some PR comments and fixed other issues. 91bfa72 [Tathagata Das] Fixed bugs. 8533094 [Tathagata Das] Scala style fixes. 028bde6 [Tathagata Das] Further refactored receiver to allow restarting of a receiver. 43f5290 [Tathagata Das] Made functions that create input streams return InputDStream and NetworkInputDStream, for both Scala and Java. 2c94579 [Tathagata Das] Fixed graceful shutdown by removing interrupts on receiving thread. 9e37a0b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api 3223e95 [Tathagata Das] Refactored the code that runs the NetworkReceiver into further classes and traits to make them more testable. a36cc48 [Tathagata Das] Refactored the NetworkReceiver API for future stability.	2014-04-21 19:04:49 -07:00
Patrick Wendell	5a5b3346c7	Dev script: include RC name in git tag	2014-04-21 14:21:17 -07:00
CodingCat	43e4a29dac	SPARK-1399: show stage failure reason in UI https://issues.apache.org/jira/browse/SPARK-1399 refactor StageTable a bit to support additional column for failed stage Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes #421 from CodingCat/SPARK-1399 and squashes the following commits: 2caba36 [CodingCat] remove dummy tag 77cf305 [CodingCat] create dummy element to wrap columns 3989ce2 [CodingCat] address Aaron's comments 18fc09f [Nan Zhu] fix compile error 00ea30a [Nan Zhu] address Kay's comments 16ac83d [CodingCat] set a default value of failureReason 35df3df [CodingCat] address andrew's comments 06d21a4 [CodingCat] address andrew's comments 25a6db6 [CodingCat] style fix dc8856d [CodingCat] show stage failure reason in UI	2014-04-21 14:10:23 -07:00
Xiangrui Meng	b7df31eb34	SPARK-1539: RDDPage.scala contains RddPage class SPARK-1386 changed RDDPage to RddPage but didn't change the filename. I tried sbt/sbt publish-local. Inside the spark-core jar, the unit name is RDDPage.class and hence I got the following error: ~~~ [error] (run-main) java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59) at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:52) at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42) at org.apache.spark.SparkContext.<init>(SparkContext.scala:215) at MovieLensALS$.main(MovieLensALS.scala:38) at MovieLensALS.main(MovieLensALS.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.storage.RddPage at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59) at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:52) at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42) at org.apache.spark.SparkContext.<init>(SparkContext.scala:215) at MovieLensALS$.main(MovieLensALS.scala:38) at MovieLensALS.main(MovieLensALS.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) ~~~ This can be fixed after renaming RddPage to RDDPage, or renaming RDDPage.scala to RddPage.scala. I chose the former since the name `RDD` is common in Spark code. Author: Xiangrui Meng <meng@databricks.com> Closes #454 from mengxr/rddpage-fix and squashes the following commits: f75e544 [Xiangrui Meng] rename RddPage to RDDPage	2014-04-21 12:48:02 -07:00
Andrew Or	af46f1fd02	[Hot Fix] Ignore org.apache.spark.ui.UISuite tests #446 faced a connection refused exception from these tests, causing them to timeout and fail after a long time. For now, let's disable these tests. (We recently disabled the corresponding test in streaming in `7863ecca35`. These tests are very similar). Author: Andrew Or <andrewor14@gmail.com> Closes #466 from andrewor14/ignore-ui-tests and squashes the following commits: 6f5a362 [Andrew Or] Ignore org.apache.spark.ui.UISuite tests	2014-04-21 12:37:43 -07:00
Patrick Wendell	fb98488fc8	Clean up and simplify Spark configuration Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements: 1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file. 2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath. 3. Adds ability to set these same variables for the driver using `spark-submit`. 4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`. 5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node. Author: Patrick Wendell <pwendell@gmail.com> Closes #299 from pwendell/config-cleanup and squashes the following commits: 127f301 [Patrick Wendell] Improvements to testing a006464 [Patrick Wendell] Moving properties file template. b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf 0086939 [Patrick Wendell] Minor style fixes af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide af0adf7 [Patrick Wendell] Automatically add user jar a56b125 [Patrick Wendell] Responses to Tom's review d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup a762901 [Patrick Wendell] Fixing test failures ffa00fe [Patrick Wendell] Review feedback fda0301 [Patrick Wendell] Note 308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN e83cd8f [Patrick Wendell] Changes to allow re-use of test applications be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set c2a2909 [Patrick Wendell] Test compile fixes 4ee6f9d [Patrick Wendell] Making YARN doc changes consistent afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors. b08893b [Patrick Wendell] Additional improvements. ace4ead [Patrick Wendell] Responses to review feedback. b72d183 [Patrick Wendell] Review feedback for spark env file 46555c1 [Patrick Wendell] Review feedback and import clean-ups 437aed1 [Patrick Wendell] Small fix 761ebcd [Patrick Wendell] Library path and classpath for drivers 7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script 5b0ba8e [Patrick Wendell] Don't ship executor envs 84cc5e5 [Patrick Wendell] Small clean-up 1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings 4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH 6eaf7d0 [Patrick Wendell] executorJavaOpts 0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS	2014-04-21 10:26:33 -07:00
Michael Armbrust	3a390bfd80	REPL cleanup. Author: Michael Armbrust <michael@databricks.com> Closes #451 from marmbrus/replCleanup and squashes the following commits: 088526a [Michael Armbrust] REPL cleanup.	2014-04-19 17:33:37 -07:00
Tor Myklebust	25fc31884b	[SPARK-1535] ALS: Avoid the garbage-creating ctor of DoubleMatrix `new DoubleMatrix(double[])` creates a garbage `double[]` of the same length as its argument and immediately throws it away. This pull request avoids that constructor in the ALS code. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #442 from tmyklebu/foo2 and squashes the following commits: 2784fc5 [Tor Myklebust] Mention that this is probably fixed as of jblas 1.2.4; repunctuate. a09904f [Tor Myklebust] Helper function for wrapping Array[Double]'s with DoubleMatrix's.	2014-04-19 15:10:18 -07:00
Michael Armbrust	10d04213ff	Add insertInto and saveAsTable to Python API. Author: Michael Armbrust <michael@databricks.com> Closes #447 from marmbrus/pythonInsert and squashes the following commits: c7ab692 [Michael Armbrust] Keep docstrings < 72 chars. ff62870 [Michael Armbrust] Add insertInto and saveAsTable to Python API.	2014-04-19 15:08:54 -07:00
Michael Armbrust	5d0f58b2eb	Use scala deprecation instead of java. This gets rid of a warning when compiling core (since we were depending on a deprecated interface with a non-deprecated function). I also tested with javac, and this does the right thing when compiling java code. Author: Michael Armbrust <michael@databricks.com> Closes #452 from marmbrus/scalaDeprecation and squashes the following commits: f628b4d [Michael Armbrust] Use scala deprecation instead of java.	2014-04-19 15:06:04 -07:00
Reynold Xin	28238c81d9	README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update	2014-04-18 22:34:39 -07:00
zsxwing	2089e0e7e7	SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and save... ...AsNewAPIHadoopDataset `writer.close` should be put in the `finally` block to avoid potential resource leaks. JIRA: https://issues.apache.org/jira/browse/SPARK-1482 Author: zsxwing <zsxwing@gmail.com> Closes #400 from zsxwing/SPARK-1482 and squashes the following commits: 06b197a [zsxwing] SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset	2014-04-18 17:49:22 -07:00
Michael Armbrust	c399baa0fc	SPARK-1456 Remove view bounds on Ordered in favor of a context bound on Ordering. This doesn't require creating new Ordering objects per row. Additionally, [view bounds are going to be deprecated](https://issues.scala-lang.org/browse/SI-7629), so we should get rid of them while APIs are still flexible. Author: Michael Armbrust <michael@databricks.com> Closes #410 from marmbrus/viewBounds and squashes the following commits: c574221 [Michael Armbrust] fix example. 812008e [Michael Armbrust] Update Java API. 1b9b85c [Michael Armbrust] Update scala doc. 35798a8 [Michael Armbrust] Remove view bounds on Ordered in favor of a context bound on Ordering.	2014-04-18 12:04:13 -07:00
Reynold Xin	81a152c54b	Fixed broken pyspark shell. Author: Reynold Xin <rxin@apache.org> Closes #444 from rxin/pyspark and squashes the following commits: fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6. 571830b [Reynold Xin] Fixed broken pyspark shell.	2014-04-18 10:10:13 -07:00
CodingCat	3c7a9bae96	SPARK-1523: improve the readability of code in AkkaUtil Actually it is separated from https://github.com/apache/spark/pull/85 as suggested by @rxin compare https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L122 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L117 the first one use get and then toLong, the second one getLong....better to make them consistent very very small fix........ Author: CodingCat <zhunansjtu@gmail.com> Closes #434 from CodingCat/SPARK-1523 and squashes the following commits: 0e86f3f [CodingCat] improve the readability of code in AkkaUtil	2014-04-18 10:05:00 -07:00
Sean Owen	8aa1f4c4f6	SPARK-1357 (addendum). More Experimental items in MLlib Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much. Author: Sean Owen <sowen@cloudera.com> Closes #372 from srowen/SPARK-1357Addendum and squashes the following commits: 17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::" 6800e4c [Sean Owen] Remove blank line after ":: Experimental ::" b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0	2014-04-18 10:04:02 -07:00
Xiangrui Meng	aa17f022c5	[SPARK-1520] remove fastutil from dependencies A quick fix for https://issues.apache.org/jira/browse/SPARK-1520 By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6. With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven. Author: Xiangrui Meng <meng@databricks.com> Closes #437 from mengxr/remove-fastutil and squashes the following commits: 00f9beb [Xiangrui Meng] remove fastutil from dependencies	2014-04-18 10:03:15 -07:00
Cheng Lian	89f47434e2	Reuses Row object in ExistingRdd.productToRowRdd() Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #432 from liancheng/reuseRow and squashes the following commits: 9e6d083 [Cheng Lian] Simplified code with BufferedIterator 52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()	2014-04-18 10:02:27 -07:00
CodingCat	e31c8ffca6	SPARK-1483: Rename minSplits to minPartitions in public APIs https://issues.apache.org/jira/browse/SPARK-1483 From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz Author: CodingCat <zhunansjtu@gmail.com> Closes #430 from CodingCat/SPARK-1483 and squashes the following commits: 4b60541 [CodingCat] deprecate defaultMinSplits ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs	2014-04-18 10:01:16 -07:00
Patrick Wendell	7863ecca35	HOTFIX: Ignore streaming UI test This is currently causing many builds to hang. https://issues.apache.org/jira/browse/SPARK-1530 Author: Patrick Wendell <pwendell@gmail.com> Closes #440 from pwendell/uitest-fix and squashes the following commits: 9a143dc [Patrick Wendell] Ignore streaming UI test	2014-04-17 17:33:24 -07:00
Patrick Wendell	6c746ba3a9	FIX: Don't build Hive in assembly unless running Hive tests. This will make the tests more stable when not running SQL tests. Author: Patrick Wendell <pwendell@gmail.com> Closes #439 from pwendell/hive-tests and squashes the following commits: 88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.	2014-04-17 17:24:00 -07:00
Thomas Graves	0058b5d2c7	SPARK-1408 Modify Spark on Yarn to point to the history server when app ... ...finishes Note this is dependent on https://github.com/apache/spark/pull/204 to have a working history server, but there are no code dependencies. This also fixes SPARK-1288 yarn stable finishApplicationMaster incomplete. Since I was in there I made the diagnostic message be passed properly. Author: Thomas Graves <tgraves@apache.org> Closes #362 from tgravescs/SPARK-1408 and squashes the following commits: ec89705 [Thomas Graves] Fix typo. 446122d [Thomas Graves] Make config yarn specific f5d5373 [Thomas Graves] SPARK-1408 Modify Spark on Yarn to point to the history server when app finishes	2014-04-17 16:36:37 -05:00
Marcelo Vanzin	69047506bf	[SPARK-1395] Allow "local:" URIs to work on Yarn. This only works for the three paths defined in the environment (SPARK_JAR, SPARK_YARN_APP_JAR and SPARK_LOG4J_CONF). Tested by running SparkPi with local: and file: URIs against Yarn cluster (no "upload" shows up in logs in the local case). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #303 from vanzin/yarn-local and squashes the following commits: 82219c1 [Marcelo Vanzin] [SPARK-1395] Allow "local:" URIs to work on Yarn.	2014-04-17 10:29:38 -05:00
AbhishekKr	bb76eae1b5	[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py	2014-04-16 19:05:40 -07:00
Sandeep	6ad4c5498d	SPARK-1462: Examples of ML algorithms are using deprecated APIs This will also fix SPARK-1464: Update MLLib Examples to Use Breeze. Author: Sandeep <sandeep@techaddict.me> Closes #416 from techaddict/1462 and squashes the following commits: a43638e [Sandeep] Some Style Changes 3ce69c3 [Sandeep] Fix Ordering and Naming of Imports in Examples 6c7e543 [Sandeep] SPARK-1462: Examples of ML algorithms are using deprecated APIs	2014-04-16 18:23:07 -07:00
Michael Armbrust	d4916a8eeb	Include stack trace for exceptions thrown by user code. It is very confusing when your code throws an exception, but the only stack trace show is in the DAGScheduler. This is a simple patch to include the stack trace for the actual failure in the error message. Suggestions on formatting welcome. Before: ``` scala> sc.parallelize(1 :: Nil).map(_ => sys.error("Ahh!")).collect() org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times (most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037) ... ``` After: ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh! scala.sys.package$.error(package.scala:27) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676) org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676) org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048) org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:110) org.apache.spark.scheduler.Task.run(Task.scala:50) org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1037) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:614) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ``` Author: Michael Armbrust <michael@databricks.com> Closes #409 from marmbrus/stacktraces and squashes the following commits: 3e4eb65 [Michael Armbrust] indent. include header for driver stack trace. 018b06b [Michael Armbrust] Include stack trace for exceptions in user code.	2014-04-16 18:12:56 -07:00
baishuo(白硕)	07b7ad3080	Update ReducedWindowedDStream.scala change _slideDuration to _windowDuration Author: baishuo(白硕) <vc_java@hotmail.com> Closes #425 from baishuo/master and squashes the following commits: 6f09ea1 [baishuo(白硕)] Update ReducedWindowedDStream.scala	2014-04-16 18:08:11 -07:00
Chen Chao	9c40b9ead0	misleading task number of groupByKey "By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389 detail is as following code : def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse for (r <- bySize if r.partitioner.isDefined) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } } Author: Chen Chao <crazyjvm@gmail.com> Closes #403 from CrazyJvm/patch-4 and squashes the following commits: 42f6c9e [Chen Chao] fix format 829a995 [Chen Chao] fix format 1568336 [Chen Chao] misleading task number of groupByKey	2014-04-16 17:58:42 -07:00
Kan Zhang	38877ccf39	Fixing a race condition in event listener unit test Author: Kan Zhang <kzhang@apache.org> Closes #401 from kanzhang/fix-1475 and squashes the following commits: c6058bd [Kan Zhang] Fixing a race condition in event listener unit test	2014-04-16 17:39:11 -07:00
Chen Chao	016a87764a	remove unnecessary brace and semicolon in 'putBlockInfo.synchronize' block delete semicolon Author: Chen Chao <crazyjvm@gmail.com> Closes #411 from CrazyJvm/patch-5 and squashes the following commits: 72333a3 [Chen Chao] remove unnecessary brace de5d9a7 [Chen Chao] style fix	2014-04-16 17:30:01 -07:00
Ankur Dave	17d323455a	SPARK-1329: Create pid2vid with correct number of partitions Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition. GraphX currently creates one entry per vertex partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case. Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #368 from ankurdave/fix-pid2vid-size and squashes the following commits: 5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions	2014-04-16 17:16:55 -07:00
Ankur Dave	235a47ce14	Rebuild routing table after Graph.reverse GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning. However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched. A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view. Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html). Author: Ankur Dave <ankurdave@gmail.com> Closes #431 from ankurdave/fix-reverse-bug and squashes the following commits: 75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse	2014-04-16 17:15:50 -07:00
Patrick Wendell	987760ec0a	Add clean to build	2014-04-16 16:32:34 -07:00
Ye Xianjin	10b1c59dcc	[SPARK-1511] use Files.move instead of renameTo in TestUtils.scala JIRA issue:[SPARK-1511](https://issues.apache.org/jira/browse/SPARK-1511) TestUtils.createCompiledClass method use renameTo() to move files which fails when the src and dest files are in different disks or partitions. This pr uses Files.move() instead. The move method will try to use renameTo() and then fall back to copy() and delete(). I think this should handle this issue. I didn't found a test suite for this file, so I add file existence detection after file moving. Author: Ye Xianjin <advancedxy@gmail.com> Closes #427 from advancedxy/SPARK-1511 and squashes the following commits: a2b97c7 [Ye Xianjin] Based on @srowen's comment, assert file existence. 6f95550 [Ye Xianjin] use Files.move instead of renameTo to handle the src and dest files are in different disks or partitions.	2014-04-16 14:56:22 -07:00
xuan	725925cf21	SPARK-1465: Spark compilation is broken with the latest hadoop-2.4.0 release YARN-1824 changes the APIs (addToEnvironment, setEnvFromInputString) in Apps, which causes the spark build to break if built against a version 2.4.0. To fix this, create the spark own function to do that functionality which will not break compiling against 2.3 and other 2.x versions. Author: xuan <xuan@MacBook-Pro.local> Author: xuan <xuan@macbook-pro.home> Closes #396 from xgong/master and squashes the following commits: 42b5984 [xuan] Remove two extra imports bc0926f [xuan] Remove usage of org.apache.hadoop.util.Shell be89fa7 [xuan] fix Spark compilation is broken with the latest hadoop-2.4.0 release	2014-04-16 14:41:22 -05:00
Sandeep	e269c24db7	SPARK-1469: Scheduler mode should accept lower-case definitions and have... ... nicer error messages There are two improvements to Scheduler Mode: 1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO). 2. If an invalid mode is given we should print a better error message. Author: Sandeep <sandeep@techaddict.me> Closes #388 from techaddict/1469 and squashes the following commits: a31bbd5 [Sandeep] SPARK-1469: Scheduler mode should accept lower-case definitions and have nicer error messages There are two improvements to Scheduler Mode: 1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO). 2. If an invalid mode is given we should print a better error message.	2014-04-16 09:58:57 -07:00
Patrick Wendell	82349fbd2b	Minor addition to SPARK-1497	2014-04-16 09:43:17 -07:00
Sean Owen	77f8367996	SPARK-1497. Fix scalastyle warnings in YARN, Hive code (I wasn't sure how to automatically set `SPARK_YARN=true` and `SPARK_HIVE=true` when running scalastyle, but these are the errors that turn up.) Author: Sean Owen <sowen@cloudera.com> Closes #413 from srowen/SPARK-1497 and squashes the following commits: f0c9318 [Sean Owen] Fix more scalastyle warnings in yarn 80bf4c3 [Sean Owen] Add YARN alpha / YARN profile to scalastyle check 026319c [Sean Owen] Fix scalastyle warnings in YARN, Hive code	2014-04-16 09:34:59 -07:00
Holden Karau	c3527a333a	SPARK-1310: Start adding k-fold cross validation to MLLib [adds kFold to MLUtils & fixes bug in BernoulliSampler] Author: Holden Karau <holden@pigscanfly.ca> Closes #18 from holdenk/addkfoldcrossvalidation and squashes the following commits: 208db9b [Holden Karau] Fix a bad space e84f2fc [Holden Karau] Fix the test, we should be looking at the second element instead 6ddbf05 [Holden Karau] swap training and validation order 7157ae9 [Holden Karau] CR feedback 90896c7 [Holden Karau] New line 150889c [Holden Karau] Fix up error messages in the MLUtilsSuite 2cb90b3 [Holden Karau] Fix the names in kFold c702a96 [Holden Karau] Fix imports in MLUtils e187e35 [Holden Karau] Move { up to same line as whenExecuting(random) in RandomSamplerSuite.scala c5b723f [Holden Karau] clean up 7ebe4d5 [Holden Karau] CR feedback, remove unecessary learners (came back during merge mistake) and insert an empty line bb5fa56 [Holden Karau] extra line sadness 163c5b1 [Holden Karau] code review feedback 1.to -> 1 to and folds -> numFolds 5a33f1d [Holden Karau] Code review follow up. e8741a7 [Holden Karau] CR feedback b78804e [Holden Karau] Remove cross validation [TODO in another pull request] 91eae64 [Holden Karau] Consolidate things in mlutils 264502a [Holden Karau] Add a test for the bug that was found with BernoulliSampler not copying the complement param dd0b737 [Holden Karau] Wrap long lines (oops) c0b7fa4 [Holden Karau] Switch FoldedRDD to use BernoulliSampler and PartitionwiseSampledRDD 08f8e4d [Holden Karau] Fix BernoulliSampler to respect complement a751ec6 [Holden Karau] Add k-fold cross validation to MLLib	2014-04-16 09:33:27 -07:00
Chen Chao	9edd88782e	update spark.default.parallelism actually, the value 8 is only valid in mesos fine-grained mode : <code> override def defaultParallelism() = sc.conf.getInt("spark.default.parallelism", 8) </code> while in coarse-grained model including mesos coares-grained, the value of the property depending on core numbers! <code> override def defaultParallelism(): Int = { conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) } </code> Author: Chen Chao <crazyjvm@gmail.com> Closes #389 from CrazyJvm/patch-2 and squashes the following commits: 84a7fe4 [Chen Chao] miss </li> at the end of every single line 04a9796 [Chen Chao] change format ee0fae0 [Chen Chao] update spark.default.parallelism	2014-04-16 09:14:18 -07:00
Cheng Lian	fec462c153	Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME When running Hive tests, the working directory is `$SPARK_HOME/sql/hive`, while when running `sbt hive/console`, it becomes `$SPARK_HOME`, and test tables are not loaded if `HIVE_DEV_HOME` is not defined. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #417 from liancheng/loadTestTables and squashes the following commits: 7cea8d6 [Cheng Lian] Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME	2014-04-16 08:54:34 -07:00
Marcelo Vanzin	c0273d806e	Make "spark logo" link refer to "/". This is not an issue with the driver UI, but when you fire up the history server, there's currently no way to go back to the app listing page without editing the browser's location field (since the logo's link points to the root of the application's own UI - i.e. the "stages" tab). The change just points the logo link to "/", which is the app listing for the history server, and the stages tab for the driver's UI. Tested with both history server and live driver. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #408 from vanzin/web-ui-root and squashes the following commits: 1b60cb6 [Marcelo Vanzin] Make "spark logo" link refer to "/".	2014-04-16 08:53:01 -07:00
Cheng Lian	6a10d80162	[SPARK-959] Updated SBT from 0.13.1 to 0.13.2 JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959) SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #426 from liancheng/updateSbt and squashes the following commits: 95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959	2014-04-16 08:52:14 -07:00
Michael Armbrust	273c2fd08d	[SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis. Author: Michael Armbrust <michael@databricks.com> Closes #354 from marmbrus/insertIntoTable and squashes the following commits: 6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests. f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable 765c506 [Michael Armbrust] Add to JavaAPI. 77b512c [Michael Armbrust] typos. 5c3ef95 [Michael Armbrust] use names for boolean args. 882afdf [Michael Armbrust] Change createTableAs to saveAsTable. Clean up api annotations. d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables. fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well. Add createTableAs function.	2014-04-15 20:40:40 -07:00
Matei Zaharia	63ca581d9c	[WIP] SPARK-1430: Support sparse data in Python MLlib This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type. On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models. Some to-do items left: - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector. - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling. - [x] Explain how to use these in the Python MLlib docs. CC @mengxr, @joshrosen Author: Matei Zaharia <matei@databricks.com> Closes #341 from mateiz/py-ml-update and squashes the following commits: d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge b9f97a3 [Matei Zaharia] Fix test 1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python 88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs 37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script. a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights 74eefe7 [Matei Zaharia] Added LabeledPoint class in Python 889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict a5d6426 [Matei Zaharia] Add linalg.py to run-tests script 0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data 2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data 154f45d [Matei Zaharia] Update docs, name some magic values 881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact	2014-04-15 20:33:24 -07:00
Xiangrui Meng	8517911efb	[FIX] update sbt-idea to version 1.6.0 I saw `No "scala-library*.jar" in Scala compiler library` error in IDEA. It seems upgrading `sbt-idea` to 1.6.0 fixed the problem. Author: Xiangrui Meng <meng@databricks.com> Closes #419 from mengxr/idea-plugin and squashes the following commits: fb3c35f [Xiangrui Meng] update sbt-idea to version 1.6.0	2014-04-15 19:37:32 -07:00
Patrick Wendell	5aaf9836f1	SPARK-1455: Better isolation for unit tests. This is a simple first step towards avoiding running the Hive tests whenever possible. Author: Patrick Wendell <pwendell@gmail.com> Closes #420 from pwendell/test-isolation and squashes the following commits: 350c8af [Patrick Wendell] SPARK-1455: Better isolation for unit tests.	2014-04-15 19:34:39 -07:00
Manish Amde	07d72fe696	Decision Tree documentation for MLlib programming guide Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release. Apart from a general review, I need specific input on the following: * I had to move a lot of the existing documentation under the linear methods umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release. * I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get. cc: @mengxr, @hirakendu, @etrain, @atalwalkar Author: Manish Amde <manish9ue@gmail.com> Closes #402 from manishamde/tree_doc and squashes the following commits: 022485a [Manish Amde] more documentation 865826e [Manish Amde] minor: grammar dbb0e5e [Manish Amde] minor improvements to text b9ef6c4 [Manish Amde] basic decision tree code examples 6e297d7 [Manish Amde] added subsections f427e84 [Manish Amde] renaming sections 9c0c4be [Manish Amde] split candidate 6925275 [Manish Amde] impurity and information gain 94fd2f9 [Manish Amde] more reorg b93125c [Manish Amde] more subsection reorg 3ecb2ad [Manish Amde] minor text addition 1537dd3 [Manish Amde] added placeholders and some doc d06511d [Manish Amde] basic skeleton	2014-04-15 11:14:28 -07:00
DB Tsai	6843d637e7	[SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation. This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr ! When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way. Let's review how updater works when returning newWeights given the input parameters. w' = w - thisIterStepSize * (gradient + regGradient(w)) Note that regGradient is function of w! If we set gradient = 0, thisIterStepSize = 1, then regGradient(w) = w - w' As a result, for regVal, it can be computed by val regVal = updater.compute( weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2 and for regGradient, it can be obtained by val regGradient = weights.sub( updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1) The PR includes the tests which compare the result with SGD with/without regularization. We did a comparison between LBFGS and SGD, and often we saw 10x less steps in LBFGS while the cost of per step is the same (just computing the gradient). The following is the paper by Prof. Ng at Stanford comparing different optimizers including LBFGS and SGD. They use them in the context of deep learning, but worth as reference. http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf Author: DB Tsai <dbtsai@alpinenow.com> Closes #353 from dbtsai/dbtsai-LBFGS and squashes the following commits: 984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer.	2014-04-15 11:12:47 -07:00
William Benton	2580a3b1a0	SPARK-1501: Ensure assertions in Graph.apply are asserted. The Graph.apply test in GraphSuite had some assertions in a closure in a graph transformation. As a consequence, these assertions never actually executed. Furthermore, these closures had a reference to (non-serializable) test harness classes because they called assert(), which could be a problem if we proactively check closure serializability in the future. This commit simply changes the Graph.apply test to collect the graph triplets so it can assert about each triplet from a map method. Author: William Benton <willb@redhat.com> Closes #415 from willb/graphsuite-nop-fix and squashes the following commits: 0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.	2014-04-15 10:38:42 -07:00

1 2 3 4 5 ...

6724 commits