ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Andrew Or	06e82d94b6	[Typo] In the maven docs: chd -> cdh Author: Andrew Or <andrewor14@gmail.com> Closes #548 from andrewor14/doc-typo and squashes the following commits: 3eaf4c4 [Andrew Or] chd -> cdh	2014-04-24 21:51:17 -07:00
Michael Armbrust	86ff8b1027	Generalize pattern for planning hash joins. This will be helpful for [SPARK-1495](https://issues.apache.org/jira/browse/SPARK-1495) and other cases where we want to have custom hash join implementations but don't want to repeat the logic for finding the join keys. Author: Michael Armbrust <michael@databricks.com> Closes #418 from marmbrus/hashFilter and squashes the following commits: d5cc79b [Michael Armbrust] Address @rxin 's comments. 366b6d9 [Michael Armbrust] style fixes 14560eb [Michael Armbrust] Generalize pattern for planning hash joins. f4809c1 [Michael Armbrust] Move common functions to PredicateHelper.	2014-04-24 21:42:33 -07:00
Tathagata Das	cd12dd9bde	[SPARK-1617] and [SPARK-1618] Improvements to streaming ui and bug fix to socket receiver 1617: These changes expose the receiver state (active or inactive) and last error in the UI 1618: If the socket receiver cannot connect in the first attempt, it should try to restart after a delay. That was broken, as the thread that restarts (hence, stops) the receiver waited on Thread.join on itself! Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #540 from tdas/streaming-ui-fix and squashes the following commits: e469434 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-ui-fix dbddf75 [Tathagata Das] Style fix. 66df1a5 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-ui-fix ad98bc9 [Tathagata Das] Refactored streaming listener to use ReceiverInfo. d7f849c [Tathagata Das] Revert "Moved BatchInfo from streaming.scheduler to streaming.ui" 5c80919 [Tathagata Das] Moved BatchInfo from streaming.scheduler to streaming.ui da244f6 [Tathagata Das] Fixed socket receiver as well as made receiver state and error visible in the streamign UI.	2014-04-24 21:34:37 -07:00
Mridul Muralidharan	968c0187a1	SPARK-1586 Windows build fixes Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues. Author: Mridul Muralidharan <mridulm80@apache.org> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #505 from mridulm/windows_fixes and squashes the following commits: ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch 3267f4b [Mridul Muralidharan] Fix build failures 35b277a [Mridul Muralidharan] Fix Scalastyle failures bc69d14 [Mridul Muralidharan] Change from hardcoded path separator 10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes 1337abd [Mridul Muralidharan] fix classpath while running in windows	2014-04-24 20:48:33 -07:00
tmalaska	d5c6ae6cc3	SPARK-1584: Upgrade Flume dependency to 1.4.0 Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes #507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version	2014-04-24 20:31:17 -07:00
Ahir Reddy	e53eb4f015	[SPARK-986]: Job cancelation for PySpark * Additions to the PySpark API to cancel jobs * Monitor Thread in PythonRDD to kill Python workers if a task is interrupted Author: Ahir Reddy <ahirreddy@gmail.com> Closes #541 from ahirreddy/python-cancel and squashes the following commits: dfdf447 [Ahir Reddy] Changed success -> completed and made logging message clearer 6c860ab [Ahir Reddy] PR Comments 4b4100a [Ahir Reddy] Success flag adba6ed [Ahir Reddy] Destroy python workers 27a2f8f [Ahir Reddy] Start the writer thread... d422f7b [Ahir Reddy] Remove unnecesssary vals adda337 [Ahir Reddy] Busy wait on the ocntext.interrupted flag, and then kill the python worker d9e472f [Ahir Reddy] Revert "removed unnecessary vals" 5b9cae5 [Ahir Reddy] removed unnecessary vals 07b54d9 [Ahir Reddy] Fix canceling unit test 8ae9681 [Ahir Reddy] Don't interrupt worker 7722342 [Ahir Reddy] Monitor Thread for python workers db04e16 [Ahir Reddy] Added canceling api to PySpark	2014-04-24 20:21:10 -07:00
Andrew Or	ee6f7e22a4	[SPARK-1615] Synchronize accesses to the LiveListenerBus' event queue Original poster is @zsxwing, who reported this bug in #516. Much of SparkListenerSuite relies on LiveListenerBus's `waitUntilEmpty()` method. As the name suggests, this waits until the event queue is empty. However, the following race condition could happen: (1) We dequeue an event (2) The queue is empty, we return true (even though the event has not been processed) (3) The test asserts something assuming that all listeners have finished executing (and fails) (4) The listeners receive and process the event This PR makes (1) and (4) atomic by synchronizing around it. To do that, however, we must avoid using `eventQueue.take`, which is blocking and will cause a deadlock if we synchronize around it. As a workaround, we use the non-blocking `eventQueue.poll` + a semaphore to provide the same semantics. This has been a possible race condition for a long time, but for some reason we've never run into it. Author: Andrew Or <andrewor14@gmail.com> Closes #544 from andrewor14/stage-info-test-fix and squashes the following commits: 3cbe40c [Andrew Or] Merge github.com:apache/spark into stage-info-test-fix 56dbbcb [Andrew Or] Check if event is actually added before releasing semaphore eb486ae [Andrew Or] Synchronize accesses to the LiveListenerBus' event queue	2014-04-24 20:18:15 -07:00
jerryshao	80429f3e2a	[SPARK-1510] Spark Streaming metrics source for metrics system This pulls in changes made by @jerryshao in https://github.com/apache/spark/pull/424 and merges with the master. Author: jerryshao <saisai.shao@intel.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #545 from tdas/streaming-metrics and squashes the following commits: 034b443 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-metrics fb3b0a5 [jerryshao] Modify according master update 21939f5 [jerryshao] Style changes according to style check error 976116b [jerryshao] Add StreamSource in StreamingContext for better monitoring through metrics system	2014-04-24 18:56:57 -07:00
Thomas Graves	44da5ab2de	Spark 1489 Fix the HistoryServer view acls This allows the view acls set by the user to be enforced by the history server. It also fixes filters being applied properly. Author: Thomas Graves <tgraves@apache.org> Closes #509 from tgravescs/SPARK-1489 and squashes the following commits: 869c186 [Thomas Graves] change to either acls enabled or disabled 0d8333c [Thomas Graves] Add history ui policy to allow acls to either use application set, history server force acls on, or off 65148b5 [Thomas Graves] SPARK-1489 Fix the HistoryServer view acls	2014-04-24 18:38:10 -07:00
Michael Armbrust	4660991e67	[SQL] Add support for parsing indexing into arrays in SQL. Author: Michael Armbrust <michael@databricks.com> Closes #518 from marmbrus/parseArrayIndex and squashes the following commits: afd2d6b [Michael Armbrust] 100 chars c3d6026 [Michael Armbrust] Add support for parsing indexing into arrays in SQL.	2014-04-24 18:21:00 -07:00
Tathagata Das	526a518bf3	[SPARK-1592][streaming] Automatically remove streaming input blocks The raw input data is stored as blocks in BlockManagers. Earlier they were cleared by cleaner ttl. Now since streaming does not require cleaner TTL to be set, the block would not get cleared. This increases up the Spark's memory usage, which is not even accounted and shown in the Spark storage UI. It may cause the data blocks to spill over to disk, which eventually slows down the receiving of data (persisting to memory become bottlenecked by writing to disk). The solution in this PR is to automatically remove those blocks. The mechanism to keep track of which BlockRDDs (which has presents the raw data blocks as a RDD) can be safely cleared already exists. Just use it to explicitly remove blocks from BlockRDDs. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #512 from tdas/block-rdd-unpersist and squashes the following commits: d25e610 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist 5f46d69 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist 2c320cd [Tathagata Das] Updated configuration with spark.streaming.unpersist setting. 2d4b2fd [Tathagata Das] Automatically removed input blocks	2014-04-24 18:18:22 -07:00
Arun Ramakrishnan	35e3d199f0	SPARK-1438 RDD.sample() make seed param optional copying form previous pull request https://github.com/apache/spark/pull/462 Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None. In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention. Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params. sample(fraction, withReplacement=false, seed=math.random) Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it. If backward compatible is important, 3 new method can be introduced (without default params) like this sample(fraction) sample(fraction, withReplacement) sample(fraction, withReplacement, seed) Added some tests for the scala RDD takeSample method. Author: Arun Ramakrishnan <smartnut007@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #477 from smartnut007/master and squashes the following commits: 07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler 8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance. 69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue 0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample	2014-04-24 17:27:16 -07:00
CodingCat	f99af8529b	SPARK-1104: kill Process in workerThread of ExecutorRunner As reported in https://spark-project.atlassian.net/browse/SPARK-1104 By @pwendell: "Sometimes due to large shuffles executors will take a long time shutting down. In particular this can happen if large numbers of shuffle files are around (this will be alleviated by SPARK-1103, but nonetheless...). The symptom is you have DEAD workers sitting around in the UI and the existing workers keep trying to re-register but can't because they've been assumed dead." In this patch, I add lines in the handler of InterruptedException in workerThread of executorRunner, so that the process.destroy() and process.waitFor() can only block the workerThread instead of blocking the worker Actor... --------- analysis: process.destroy() is a blocking method, i.e. it only returns when all shutdownHook threads return...so calling it in Worker thread will make Worker block for a long while.... about what will happen on the shutdown hooks when the JVM process is killed: http://www.tutorialspoint.com/java/lang/runtime_addshutdownhook.htm Author: CodingCat <zhunansjtu@gmail.com> Closes #35 from CodingCat/SPARK-1104 and squashes the following commits: 85767da [CodingCat] add null checking and remove unnecessary killProce 3107aeb [CodingCat] address Aaron's comments eb615ba [CodingCat] kill the process when the error happens 0accf2f [CodingCat] set process to null after killed it 1d511c8 [CodingCat] kill Process in workerThread	2014-04-24 15:55:18 -07:00
Sandeep	a03ac222d8	Fix Scala Style Any comments are welcome Author: Sandeep <sandeep@techaddict.me> Closes #531 from techaddict/stylefix-1 and squashes the following commits: 7492730 [Sandeep] Pass 4 98b2428 [Sandeep] fix rxin suggestions b5e2e6f [Sandeep] Pass 3 05932d7 [Sandeep] fix if else styling 2 08690e5 [Sandeep] fix if else styling	2014-04-24 15:07:23 -07:00
Michael Armbrust	c5c1916dd1	SPARK-1494 Don't initialize classes loaded by MIMA excludes, attempt 2 [WIP] Looks like scala reflection was invoking the static initializer: ``` ... at org.apache.spark.sql.test.TestSQLContext$.<init>(TestSQLContext.scala:25) at org.apache.spark.sql.test.TestSQLContext$.<clinit>(TestSQLContext.scala) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:500) at scala.reflect.runtime.JavaMirrors$JavaMirror.tryJavaClass(JavaMirrors.scala:505) at scala.reflect.runtime.SymbolLoaders$PackageScope.lookupEntry(SymbolLoaders.scala:109) ... ``` Need to make sure that this doesn't change the exclusion semantics before merging. Author: Michael Armbrust <michael@databricks.com> Closes #526 from marmbrus/mima and squashes the following commits: 8168dea [Michael Armbrust] Spurious change afba262 [Michael Armbrust] Prevent Scala reflection from running static class initializer.	2014-04-24 14:54:23 -07:00
Thomas Graves	bd375094a1	Spark 1490 Add kerberos support to the HistoryServer Here I've added the ability for the History server to login from a kerberos keytab file so that the history server can be run as a super user and stay up for along period of time while reading the history files from HDFS. Author: Thomas Graves <tgraves@apache.org> Closes #513 from tgravescs/SPARK-1490 and squashes the following commits: e204a99 [Thomas Graves] remove extra logging 5418daa [Thomas Graves] fix typo in config 0076b99 [Thomas Graves] Update docs 4d76545 [Thomas Graves] SPARK-1490 Add kerberos support to the HistoryServer	2014-04-24 11:16:30 -07:00
zsxwing	78a49b2532	SPARK-1611: Fix incorrect initialization order in AppendOnlyMap JIRA: https://issues.apache.org/jira/browse/SPARK-1611 Author: zsxwing <zsxwing@gmail.com> Closes #534 from zsxwing/SPARK-1611 and squashes the following commits: 96af089 [zsxwing] SPARK-1611: Fix incorrect initialization order in AppendOnlyMap	2014-04-24 11:13:40 -07:00
Sean Owen	6338a93f10	SPARK-1488. Squash more language feature warnings in new commits by importing implicitConversion A recent commit reintroduced some of the same warnings that SPARK-1488 resolved. These are just a few more of the same changes to remove these warnings. Author: Sean Owen <sowen@cloudera.com> Closes #528 from srowen/SPARK-1488.2 and squashes the following commits: 62d592c [Sean Owen] More feature warnings in tests 4e2e94b [Sean Owen] Squash more language feature warnings in new commits by importing implicitConversion	2014-04-24 10:06:18 -07:00
Patrick Wendell	faeb761cbe	Small changes to release script	2014-04-24 10:00:34 -07:00
Takuya UESHIN	27b2821cf1	[SPARK-1610] [SQL] Fix Cast to use exact type value when cast from BooleanType to NumericTy... ...pe. `Cast` from `BooleanType` to `NumericType` are all using `Int` value. But it causes `ClassCastException` when the casted value is used by the following evaluation like the code below: ``` scala scala> import org.apache.spark.sql.catalyst._ import org.apache.spark.sql.catalyst._ scala> import types._ import types._ scala> import expressions._ import expressions._ scala> Add(Cast(Literal(true), ShortType), Literal(1.toShort)).eval() java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102) at scala.math.Numeric$ShortIsIntegral$.plus(Numeric.scala:72) at org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58) at org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58) at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58) at .<init>(<console>:17) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983) at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568) at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717) at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581) at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588) at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837) at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83) at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96) at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105) at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala) ``` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #533 from ueshin/issues/SPARK-1610 and squashes the following commits: 70f36e8 [Takuya UESHIN] Fix Cast to use exact type value when cast from BooleanType to NumericType.	2014-04-24 09:57:28 -07:00
Reynold Xin	1fdf659d2f	SPARK-1601 & SPARK-1602: two bug fixes related to cancellation This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached. 1. Do not put partially executed partitions into cache (in task killing). 2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs. Thanks @aarondav and @ahirreddy for reporting and helping debug. Author: Reynold Xin <rxin@apache.org> Closes #521 from rxin/kill and squashes the following commits: 401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill 7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala. 35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).	2014-04-24 00:27:45 -07:00
Mridul Muralidharan	dd681f502e	SPARK-1587 Fix thread leak mvn test fails (intermittently) due to thread leak - since scalatest runs all tests in same vm. Author: Mridul Muralidharan <mridulm80@apache.org> Closes #504 from mridulm/resource_leak_fixes and squashes the following commits: a5d10d0 [Mridul Muralidharan] Prevent thread leaks while running tests : cleanup all threads when SparkContext.stop is invoked. Causes tests to fail 7b5e19c [Mridul Muralidharan] Prevent NPE while running tests	2014-04-23 23:20:55 -07:00
Sandeep	bb68f47745	[Fix #79 ] Replace Breakable For Loops By While Loops Author: Sandeep <sandeep@techaddict.me> Closes #503 from techaddict/fix-79 and squashes the following commits: e3f6746 [Sandeep] Style changes 07a4f6b [Sandeep] for loop to While loop 0a6d8e9 [Sandeep] Breakable for loop to While loop	2014-04-23 22:47:59 -07:00
zsxwing	6ab7578067	SPARK-1589: Fix the incorrect compare JIRA: https://issues.apache.org/jira/browse/SPARK-1589 Author: zsxwing <zsxwing@gmail.com> Closes #508 from zsxwing/SPARK-1589 and squashes the following commits: 570c67a [zsxwing] SPARK-1589: Fix the incorrect compare	2014-04-23 22:36:02 -07:00
Ankur Dave	1d6abe3a4b	Mark all fields of EdgePartition, Graph, and GraphOps transient These classes are only serializable to work around closure capture, so their fields should all be marked `@transient` to avoid wasteful serialization. This PR supersedes apache/spark#519 and fixes the same bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #520 from ankurdave/graphx-transient and squashes the following commits: 6431760 [Ankur Dave] Mark all fields of EdgePartition, Graph, and GraphOps `@transient`	2014-04-23 22:01:13 -07:00
Aaron Davidson	d485eecb72	Update Java api for setJobGroup with interruptOnCancel Also adds a unit test. Author: Aaron Davidson <aaron@databricks.com> Closes #522 from aarondav/cancel2 and squashes the following commits: 565c253 [Aaron Davidson] Update Java api for setJobGroup with interruptOnCancel 65b33d8 [Aaron Davidson] Add unit test for Thread interruption on cancellation	2014-04-23 22:00:22 -07:00
Andrew Or	4b2bab1d08	[Hot Fix #469 ] Fix flaky test in SparkListenerSuite The two modified tests may fail if the race condition does not bid in our favor... Author: Andrew Or <andrewor14@gmail.com> Closes #516 from andrewor14/stage-info-test-fix and squashes the following commits: b4b6100 [Andrew Or] Add/replace missing waitUntilEmpty() calls to listener bus	2014-04-23 21:59:33 -07:00
Matei Zaharia	640f9a0efe	[SPARK-1540] Add an optional Ordering parameter to PairRDDFunctions. In https://issues.apache.org/jira/browse/SPARK-1540 we'd like to look at Spark's API to see if we can take advantage of Comparable keys in more places, which will make external spilling more efficient. This PR is a first step towards that that shows how to pass an Ordering when available and still continue functioning otherwise. It does this using a new implicit parameter with a default value of null. The API is currently only in Scala -- in Java we'd have to add new versions of mapToPair and such that take a Comparator, or a new method to add a "type hint" to an RDD. We can address those later though. Unfortunately requiring all keys to be Comparable would not work without requiring RDDs in general to contain only Comparable types. The reason is that methods such as distinct() and intersection() do a shuffle, but should be usable on RDDs of any type. So ordering will have to remain an optimization for the types that can be ordered. I think this isn't a horrible outcome though because one of the nice things about Spark's API is that it works on objects of any type, without requiring you to specify a schema or implement Writable or stuff like that. Author: Matei Zaharia <matei@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@apache.org> Closes #487 from mateiz/ordered-keys and squashes the following commits: bd565f6 [Matei Zaharia] Pass an Ordering to only one version of groupBy because the Scala language spec doesn't allow having an optional parameter on all of them (this was only compiling in Scala 2.10 due to a bug). 4629965 [Matei Zaharia] Add tests for other versions of groupBy 3beae85 [Matei Zaharia] Added a test for implicit orderings 80b7a3b [Matei Zaharia] Add an optional Ordering parameter to PairRDDFunctions.	2014-04-23 17:03:54 -07:00
Aaron Davidson	432201c7ee	SPARK-1582 Invoke Thread.interrupt() when cancelling jobs Sometimes executor threads are blocked waiting for IO or monitors, and the current implementation of job cancellation may never recover these threads. By simply invoking Thread.interrupt() during cancellation, we can often safely unblock the threads and use them for subsequent work. Note that this feature must remain optional for now because of a bug in HDFS where Thread.interrupt() may cause nodes to be marked as permanently dead (as the InterruptedException is reinterpreted as an IOException during communication with some node). Author: Aaron Davidson <aaron@databricks.com> Closes #498 from aarondav/cancel and squashes the following commits: e52b829 [Aaron Davidson] Don't use job.properties when null 82f78bb [Aaron Davidson] Update DAGSchedulerSuite b67f472 [Aaron Davidson] Add comment on why interruptOnCancel is in setJobGroup 4cb9fd6 [Aaron Davidson] SPARK-1582 Invoke Thread.interrupt() when cancelling jobs	2014-04-23 16:52:49 -07:00
Marcelo Vanzin	dd1b7a61d9	Honor default fs name when initializing event logger. This is related to SPARK-1459 / PR #375. Without this fix, FileLogger.createLogDir() may try to create the log dir on HDFS, while createWriter() will try to open the log file on the local file system, leading to interesting errors and confusion. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #450 from vanzin/event-file-2 and squashes the following commits: 592cdb3 [Marcelo Vanzin] Honor default fs name when initializing event logger.	2014-04-23 14:47:38 -07:00
Aaron Davidson	a967b005c8	SPARK-1572 Don't kill Executor if PythonRDD fails while computing parent Previously, the behavior was that if the parent RDD threw any exception other than IOException or FileNotFoundException (which is quite possible for Hadoop input sources), the entire Executor would crash, because the default thread a uncaught exception handler calls System.exit(). This patch avoids two related issues: 1. Always catch exceptions in this reader thread. 2. Don't mask readerException when Python throws an EOFError after worker.shutdownOutput() is called. Author: Aaron Davidson <aaron@databricks.com> Closes #486 from aarondav/pyspark and squashes the following commits: fbb11e9 [Aaron Davidson] Make sure FileNotFoundExceptions are handled same as before b9acb3e [Aaron Davidson] SPARK-1572 Don't kill Executor if PythonRDD fails while computing parent	2014-04-23 14:46:30 -07:00
zsxwing	a664606613	SPARK-1583: Fix a bug that using java.util.HashMap by mistake JIRA: https://issues.apache.org/jira/browse/SPARK-1583 Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems. Author: zsxwing <zsxwing@gmail.com> Closes #500 from zsxwing/SPARK-1583 and squashes the following commits: 7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake	2014-04-23 14:12:20 -07:00
Patrick Wendell	cd4ed29326	SPARK-1119 and other build improvements 1. Makes assembly and examples jar naming consistent in maven/sbt. 2. Updates make-distribution.sh to use Maven and fixes some bugs. 3. Updates the create-release script to call make-distribution script. Author: Patrick Wendell <pwendell@gmail.com> Closes #502 from pwendell/make-distribution and squashes the following commits: 1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements	2014-04-23 10:19:32 -07:00
Michael Armbrust	39f85e0322	[SQL] SPARK-1571 Mistake in java example code Author: Michael Armbrust <michael@databricks.com> Closes #496 from marmbrus/javaBeanBug and squashes the following commits: 644fedd [Michael Armbrust] Bean methods must be public.	2014-04-22 22:19:32 -07:00
Michael Armbrust	8e95081333	SPARK-1494 Don't initialize classes loaded by MIMA excludes. [WIP] Just seeing how Jenkins likes this... Author: Michael Armbrust <michael@databricks.com> Closes #494 from marmbrus/mima and squashes the following commits: 6eec616 [Michael Armbrust] Force hive tests to run. acaf682 [Michael Armbrust] Don't initialize loaded classes.	2014-04-22 22:02:42 -07:00
Michael Armbrust	aa77f8a6a6	SPARK-1562 Fix visibility / annotation of Spark SQL APIs Author: Michael Armbrust <michael@databricks.com> Closes #489 from marmbrus/sqlDocFixes and squashes the following commits: acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs	2014-04-22 20:02:33 -07:00
Xiangrui Meng	662c860ebc	[FIX: SPARK-1376] use --arg instead of --args in SparkSubmit to avoid warning messages Even if users use `--arg`, `SparkSubmit` still uses `--args` for child args internally, which triggers a warning message that may confuse users: ~~~ --args is deprecated. Use --arg instead. ~~~ @sryza Does it look good to you? Author: Xiangrui Meng <meng@databricks.com> Closes #485 from mengxr/submit-arg and squashes the following commits: 5e1b9fe [Xiangrui Meng] update test cebbeb7 [Xiangrui Meng] use --arg instead of --args in SparkSubmit to avoid warning messages	2014-04-22 19:38:27 -07:00
Tathagata Das	f3d19a9f1a	[streaming][SPARK-1578] Removed requirement for TTL in StreamingContext. Since shuffles and RDDs that are out of context are automatically cleaned by Spark core (using ContextCleaner) there is no need for setting the cleaner TTL while creating a StreamingContext. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #491 from tdas/ttl-fix and squashes the following commits: cf01dc7 [Tathagata Das] Removed requirement for TTL in StreamingContext.	2014-04-22 19:35:13 -07:00
Andrew Or	2de573877f	[Spark-1538] Fix SparkUI incorrectly hiding persisted RDDs Bug: After the following command `sc.parallelize(1 to 1000).persist.map(_ + 1).count()` is run, the the persisted RDD is missing from the storage tab of the SparkUI. Cause: The command creates two RDDs in one stage, a `ParallelCollectionRDD` and a `MappedRDD`. However, the existing StageInfo only keeps the RDDInfo of the last RDD associated with the stage (`MappedRDD`), and so all RDD information regarding the first RDD (`ParallelCollectionRDD`) is discarded. In this case, we persist the first RDD, but the StorageTab doesn't know about this RDD because it is not encoded in the StageInfo. Fix: Record information of all RDDs in StageInfo, instead of just the last RDD (i.e. `stage.rdd`). Since stage boundaries are marked by shuffle dependencies, the solution is to traverse the last RDD's dependency tree, visiting only ancestor RDDs related through a sequence of narrow dependencies. --- This PR also moves RDDInfo to its own file, includes a few style fixes, and adds a unit test for constructing StageInfos. Author: Andrew Or <andrewor14@gmail.com> Closes #469 from andrewor14/storage-ui-fix and squashes the following commits: 07fc7f0 [Andrew Or] Add back comment that was accidentally removed (minor) 5d799fe [Andrew Or] Add comment to justify testing of getNarrowAncestors with cycles 9d0e2b8 [Andrew Or] Hide details of getNarrowAncestors from outsiders d2bac8a [Andrew Or] Deal with cycles in RDD dependency graph + add extensive tests 2acb177 [Andrew Or] Move getNarrowAncestors to RDD.scala bfe83f0 [Andrew Or] Backtrace RDD dependency tree to find all RDDs that belong to a Stage	2014-04-22 19:24:03 -07:00
Patrick Wendell	995fdc96bc	Assorted clean-up for Spark-on-YARN. In particular when the HADOOP_CONF_DIR is not not specified. Author: Patrick Wendell <pwendell@gmail.com> Closes #488 from pwendell/hadoop-cleanup and squashes the following commits: fe95f13 [Patrick Wendell] Changes based on Andrew's feeback 18d09c1 [Patrick Wendell] Review comments from Andrew 17929cc [Patrick Wendell] Assorted clean-up for Spark-on-YARN.	2014-04-22 19:22:06 -07:00
Kan Zhang	ea8cea82a0	[SPARK-1570] Fix classloading in JavaSQLContext.applySchema I think I hit a class loading issue when running JavaSparkSQL example using spark-submit in local mode. Author: Kan Zhang <kzhang@apache.org> Closes #484 from kanzhang/SPARK-1570 and squashes the following commits: feaaeba [Kan Zhang] [SPARK-1570] Fix classloading in JavaSQLContext.applySchema	2014-04-22 15:05:12 -07:00
Marcelo Vanzin	0ea0b1a2d6	Fix compilation on Hadoop 2.4.x. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #483 from vanzin/yarn-2.4 and squashes the following commits: 0fc57d8 [Marcelo Vanzin] Fix compilation on Hadoop 2.4.x.	2014-04-22 14:28:41 -07:00
Andrew Or	745e496c59	[Fix #204 ] Eliminate delay between binding and log checking Bug: In the existing history server, there is a `spark.history.updateInterval` seconds delay before application logs show up on the UI. Cause: This is because the following events happen in this order: (1) The background thread that checks for logs starts, but realizes the server has not yet bound and so waits for N seconds, (2) server binds, (3) N seconds later the background thread finds that the server has finally bound to a port, and so finally checks for application logs. Fix: This PR forces the log checking thread to start immediately after binding. It also documents two relevant environment variables that are currently missing. Author: Andrew Or <andrewor14@gmail.com> Closes #441 from andrewor14/history-server-fix and squashes the following commits: b2eb46e [Andrew Or] Document SPARK_PUBLIC_DNS and SPARK_HISTORY_OPTS for the history server e8d1fbc [Andrew Or] Eliminate delay between binding and checking for logs	2014-04-22 14:27:49 -07:00
Xiangrui Meng	26d35f3fd9	[SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0 Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng <meng@databricks.com> Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc	2014-04-22 11:20:47 -07:00
Tor Myklebust	bf9d49b6d1	[SPARK-1281] Improve partitioning in ALS ALS was using HashPartitioner and explicit uses of `%` together. Further, the naked use of `%` meant that, if the number of partitions corresponded with the stride of arithmetic progressions appearing in user and product ids, users and products could be mapped into buckets in an unfair or unwise way. This pull request: 1) Makes the Partitioner an instance variable of ALS. 2) Replaces the direct uses of `%` with calls to a Partitioner. 3) Defines an anonymous Partitioner that scrambles the bits of the object's hashCode before reducing to the number of present buckets. This pull request does not make the partitioner user-configurable. I'm not all that happy about the way I did (1). It introduces an icky lifetime issue and dances around it by nulling something. However, I don't know a better way to make the partitioner visible everywhere it needs to be visible. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #407 from tmyklebu/master and squashes the following commits: dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go. 23d6f91 [Tor Myklebust] Stop making the partitioner configurable. 495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark 674933a [Tor Myklebust] Fix style. 40edc23 [Tor Myklebust] Fix missing space. f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach. 5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'. 36a0f43 [Tor Myklebust] Make the partitioner private. d872b09 [Tor Myklebust] Add negative id ALS test. df27697 [Tor Myklebust] Support custom partitioners. Currently we use the same partitioner for users and products. c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing. c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.	2014-04-22 11:07:30 -07:00
Xusen Yin	c919798f09	fix bugs of dot in python If there are no `transpose()` in `self.theta`, a ValueError: matrices are not aligned is occurring. The former test case just ignore this situation. Author: Xusen Yin <yinxusen@gmail.com> Closes #463 from yinxusen/python-naive-bayes and squashes the following commits: fcbe3bc [Xusen Yin] fix bugs of dot in python	2014-04-22 11:06:18 -07:00
Ahir Reddy	0f87e6ad43	[SPARK-1560]: Updated Pyrolite Dependency to be Java 6 compatible Changed the Pyrolite dependency to a build which targets Java 6. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits: 8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible	2014-04-22 09:44:41 -07:00
CodingCat	87de29084e	[HOTFIX] SPARK-1399: remove outdated comments as the original PR was merged before this mistake is found....fix here, Sorry about that @pwendell, @andrewor14, I will be more careful next time Author: CodingCat <zhunansjtu@gmail.com> Closes #474 from CodingCat/hotfix_1399 and squashes the following commits: f3a8ba9 [CodingCat] move outdated comments	2014-04-22 09:43:13 -07:00
Patrick Wendell	83084d3b7b	SPARK-1496: Have jarOfClass return Option[String] A simple change, mostly had to change a bunch of example code. Author: Patrick Wendell <pwendell@gmail.com> Closes #438 from pwendell/jar-of-class and squashes the following commits: aa010ff [Patrick Wendell] SPARK-1496: Have jarOfClass return Option[String]	2014-04-22 00:42:16 -07:00
Marcelo Vanzin	ac164b79d1	[SPARK-1459] Use local path (and not complete URL) when opening local lo... ...g file. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #375 from vanzin/event-file and squashes the following commits: f673029 [Marcelo Vanzin] [SPARK-1459] Use local path (and not complete URL) when opening local log file.	2014-04-21 23:10:53 -07:00

1 2 3 4 5 ...

6776 commits