Commit graph

6792 commits

Author SHA1 Message Date
Patrick Wendell aa9a7f5db7 SPARK-1606: Infer user application arguments instead of requiring --arg.
This modifies spark-submit to do something more like the Hadoop `jar`
command. Now we have the following syntax:

./bin/spark-submit [options] user.jar [user options]

Author: Patrick Wendell <pwendell@gmail.com>

Closes #563 from pwendell/spark-submit and squashes the following commits:

32241fc [Patrick Wendell] Review feedback
3adfb69 [Patrick Wendell] Small fix
bc48139 [Patrick Wendell] SPARK-1606: Infer user application arguments instead of requiring --arg.
2014-04-26 19:24:29 -07:00
Sandeep 762af4e9c2 SPARK-1467: Make StorageLevel.apply() factory methods Developer APIs
We may want to evolve these in the future to add things like SSDs, so let's mark them as experimental for now. Long-term the right solution might be some kind of builder. The stable API should be the existing StorageLevel constants.

Author: Sandeep <sandeep@techaddict.me>

Closes #551 from techaddict/SPARK-1467 and squashes the following commits:

6bdda24 [Sandeep] SPARK-1467: Make StorageLevel.apply() factory methods as Developer Api's We may want to evolve these in the future to add things like SSDs, so let's mark them as experimental for now. Long-term the right solution might be some kind of builder. The stable API should be the existing StorageLevel constants.
2014-04-26 19:04:33 -07:00
Takuya UESHIN 8e37ed6eb8 [SPARK-1608] [SQL] Fix Cast.nullable when cast from StringType to NumericType/TimestampType.
`Cast.nullable` should be `true` when cast from `StringType` to `NumericType` or `TimestampType`.
Because if `StringType` expression has an illegal number string or illegal timestamp string, the casted value becomes `null`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #532 from ueshin/issues/SPARK-1608 and squashes the following commits:

065d37c [Takuya UESHIN] Add tests to check nullabilities of cast expressions.
f278ed7 [Takuya UESHIN] Revert test to keep it readable and concise.
9fc9380 [Takuya UESHIN] Fix Cast.nullable when cast from StringType to NumericType/TimestampType.
2014-04-26 14:39:54 -07:00
wangfei e6e44e46e3 add note of how to support table with more than 22 fields
Author: wangfei <wangfei1@huawei.com>

Closes #564 from scwf/patch-6 and squashes the following commits:

a331876 [wangfei] Update sql-programming-guide.md
685135b [wangfei] Update sql-programming-guide.md
10b3dc0 [wangfei] Update sql-programming-guide.md
1c40480 [wangfei] add note of how to support table with 22 fields
2014-04-26 14:38:42 -07:00
zsxwing 058797c172 [Spark-1382] Fix NPE in DStream.slice (updated version of #365)
@zsxwing I cherry-picked your changes and merged the master. #365 had some conflicts once again!

Author: zsxwing <zsxwing@gmail.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #562 from tdas/SPARK-1382 and squashes the following commits:

e2962c1 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-1382
20968d9 [zsxwing] Replace Exception with SparkException in DStream
e476651 [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-1382
35ba56a [zsxwing] SPARK-1382: Fix NPE in DStream.slice
2014-04-25 19:04:34 -07:00
Sandy Ryza 87cf35c2d6 SPARK-1632. Remove unnecessary boxing in compares in ExternalAppendOnlyM...
...ap

Author: Sandy Ryza <sandy@cloudera.com>

Closes #559 from sryza/sandy-spark-1632 and squashes the following commits:

a6cd352 [Sandy Ryza] Only compute hashes once
04e3884 [Sandy Ryza] SPARK-1632. Remove unnecessary boxing in compares in ExternalAppendOnlyMap
2014-04-25 17:55:04 -07:00
CodingCat 027f1b85f9 SPARK-1235: manage the DAGScheduler EventProcessActor with supervisor and refactor the DAGScheduler with Akka
https://spark-project.atlassian.net/browse/SPARK-1235

In the current implementation, the running job will hang if the DAGScheduler crashes for some reason (eventProcessActor throws exception in receive() )

The reason is that the actor will automatically restart when the exception is thrown during the running but is not captured properly (Akka behaviour), and the JobWaiters are still waiting there for the completion of the tasks

In this patch, I refactored the DAGScheduler with Akka and manage the eventProcessActor with supervisor, so that upon the failure of a eventProcessActor, the supervisor will terminate the EventProcessActor and close the SparkContext

thanks for @kayousterhout and @markhamstra to give the hints in JIRA

Author: CodingCat <zhunansjtu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>

Closes #186 from CodingCat/SPARK-1235 and squashes the following commits:

a7fb0ee [CodingCat] throw Exception on failure of creating DAG
124d82d [CodingCat] blocking the constructor until event actor is ready
baf2d38 [CodingCat] fix the issue brought by non-blocking actorOf
35c886a [CodingCat] fix bug
82d08b3 [CodingCat] calling actorOf on system to ensure it is blocking
310a579 [CodingCat] style fix
cd02d9a [Nan Zhu] small fix
561cfbc [CodingCat] recover doCheckpoint
c048d0e [CodingCat] call submitWaitingStages for every event
a9eea039 [CodingCat] address Matei's comments
ac878ab [CodingCat] typo fix
5d1636a [CodingCat] re-trigger the test.....
9dfb033 [CodingCat] remove unnecessary changes
a7a2a97 [CodingCat] add StageCancelled message
fdf3b17 [CodingCat] just to retrigger the test......
089bc2f [CodingCat] address andrew's comments
228f4b0 [CodingCat] address comments from Mark
b68c1c7 [CodingCat] refactor DAGScheduler with Akka
810efd8 [Xiangrui Meng] akka solution
2014-04-25 16:04:48 -07:00
Sean Owen df6d81425b SPARK-1607. HOTFIX: Fix syntax adapting Int result to Short
Sorry folks. This should make the change for SPARK-1607 compile again. Verified this time with the yarn build enabled.

Author: Sean Owen <sowen@cloudera.com>

Closes #556 from srowen/SPARK-1607.2 and squashes the following commits:

e3fe7a3 [Sean Owen] Fix syntax adapting Int result to Short
2014-04-25 14:17:38 -07:00
baishuo(白硕) 8aaef5c756 Update KafkaWordCount.scala
modify the required args number

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #523 from baishuo/master and squashes the following commits:

0368ba9 [baishuo(白硕)] Update KafkaWordCount.scala
2014-04-25 13:18:49 -07:00
WangTao 25a276dd21 Delete the val that never used
It seems that the val "startTime" and "endTime" is never used, so delete them.

Author: WangTao <barneystinson@aliyun.com>

Closes #553 from WangTaoTheTonic/master and squashes the following commits:

4fcb639 [WangTao] Delete the val that never used
2014-04-25 11:47:01 -07:00
Matei Zaharia a24d918c71 SPARK-1621 Upgrade Chill to 0.3.6
It registers more Scala classes, including things like Ranges that we had to register manually before. See https://github.com/twitter/chill/releases for Chill's change log.

Author: Matei Zaharia <matei@databricks.com>

Closes #543 from mateiz/chill-0.3.6 and squashes the following commits:

a1dc5e0 [Matei Zaharia] Upgrade Chill to 0.3.6 and remove our special registration of Ranges
2014-04-25 11:12:41 -07:00
Patrick Wendell dc3b640a0a SPARK-1619 Launch spark-shell with spark-submit
This simplifies the shell a bunch and passes all arguments through to spark-submit.

There is a tiny incompatibility from 0.9.1 which is that you can't put `-c` _or_ `--cores`, only `--cores`. However, spark-submit will give a good error message in this case, I don't think many people used this, and it's a trivial change for users.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #542 from pwendell/spark-shell and squashes the following commits:

9eb3e6f [Patrick Wendell] Updating Spark docs
b552459 [Patrick Wendell] Andrew's feedback
97720fa [Patrick Wendell] Review feedback
aa2900b [Patrick Wendell] SPARK-1619 Launch spark-shell with spark-submit
2014-04-24 23:59:16 -07:00
Sean Owen 6e101f1183 SPARK-1607. Replace octal literals, removed in Scala 2.11, with hex literals
Octal literals like "0700" are deprecated in Scala 2.10, generating a warning. They have been removed entirely in 2.11. See https://issues.scala-lang.org/browse/SI-7618

This change simply replaces two uses of octals with hex literals, which seemed the next-best representation since they express a bit mask (file permission in particular)

Author: Sean Owen <sowen@cloudera.com>

Closes #529 from srowen/SPARK-1607 and squashes the following commits:

1ee0e67 [Sean Owen] Use Integer.parseInt(...,8) for octal literal instead of hex equivalent
0102f3d [Sean Owen] Replace octal literals, removed in Scala 2.11, with hex literals
2014-04-24 23:34:00 -07:00
Aaron Davidson 45ad7f0ca7 Call correct stop().
Oopsie in #504.

Author: Aaron Davidson <aaron@databricks.com>

Closes #527 from aarondav/stop and squashes the following commits:

8d1446a [Aaron Davidson] Call correct stop().
2014-04-24 23:22:03 -07:00
Holden Karau e03bc379ee SPARK-1242 Add aggregate to python rdd
Author: Holden Karau <holden@pigscanfly.ca>

Closes #139 from holdenk/add_aggregate_to_python_api and squashes the following commits:

0f39ae3 [Holden Karau] Merge in master
4879c75 [Holden Karau] CR feedback, fix issue with empty RDDs in aggregate
70b4724 [Holden Karau] Style fixes from code review
96b047b [Holden Karau] Add aggregate to python rdd
2014-04-24 23:07:54 -07:00
Sandeep 095b518253 Fix [SPARK-1078]: Remove the Unnecessary lift-json dependency
Remove the Unnecessary lift-json dependency from pom.xml

Author: Sandeep <sandeep@techaddict.me>

Closes #536 from techaddict/FIX-SPARK-1078 and squashes the following commits:

bd0fd1d [Sandeep] Fix [SPARK-1078]: Replace lift-json with json4s-jackson. Remove the Unnecessary lift-json dependency from pom.xml
2014-04-24 21:51:52 -07:00
Andrew Or 06e82d94b6 [Typo] In the maven docs: chd -> cdh
Author: Andrew Or <andrewor14@gmail.com>

Closes #548 from andrewor14/doc-typo and squashes the following commits:

3eaf4c4 [Andrew Or] chd -> cdh
2014-04-24 21:51:17 -07:00
Michael Armbrust 86ff8b1027 Generalize pattern for planning hash joins.
This will be helpful for [SPARK-1495](https://issues.apache.org/jira/browse/SPARK-1495) and other cases where we want to have custom hash join implementations but don't want to repeat the logic for finding the join keys.

Author: Michael Armbrust <michael@databricks.com>

Closes #418 from marmbrus/hashFilter and squashes the following commits:

d5cc79b [Michael Armbrust] Address @rxin 's comments.
366b6d9 [Michael Armbrust] style fixes
14560eb [Michael Armbrust] Generalize pattern for planning hash joins.
f4809c1 [Michael Armbrust] Move common functions to PredicateHelper.
2014-04-24 21:42:33 -07:00
Tathagata Das cd12dd9bde [SPARK-1617] and [SPARK-1618] Improvements to streaming ui and bug fix to socket receiver
1617: These changes expose the receiver state (active or inactive) and last error in the UI
1618: If the socket receiver cannot connect in the first attempt, it should try to restart after a delay. That was broken, as the thread that restarts (hence, stops) the receiver waited on Thread.join on itself!

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #540 from tdas/streaming-ui-fix and squashes the following commits:

e469434 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-ui-fix
dbddf75 [Tathagata Das] Style fix.
66df1a5 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-ui-fix
ad98bc9 [Tathagata Das] Refactored streaming listener to use ReceiverInfo.
d7f849c [Tathagata Das] Revert "Moved BatchInfo from streaming.scheduler to streaming.ui"
5c80919 [Tathagata Das] Moved BatchInfo from streaming.scheduler to streaming.ui
da244f6 [Tathagata Das] Fixed socket receiver as well as made receiver state and error visible in the streamign UI.
2014-04-24 21:34:37 -07:00
Mridul Muralidharan 968c0187a1 SPARK-1586 Windows build fixes
Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.

Author: Mridul Muralidharan <mridulm80@apache.org>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #505 from mridulm/windows_fixes and squashes the following commits:

ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
3267f4b [Mridul Muralidharan] Fix build failures
35b277a [Mridul Muralidharan] Fix Scalastyle failures
bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
1337abd [Mridul Muralidharan] fix classpath while running in windows
2014-04-24 20:48:33 -07:00
tmalaska d5c6ae6cc3 SPARK-1584: Upgrade Flume dependency to 1.4.0
Updated the Flume dependency in the maven pom file and the scala build file.

Author: tmalaska <ted.malaska@cloudera.com>

Closes #507 from tmalaska/master and squashes the following commits:

79492c8 [tmalaska] excluded all thrift
159c3f1 [tmalaska] fixed the flume pom file issues
5bf56a7 [tmalaska] Upgrade flume version
2014-04-24 20:31:17 -07:00
Ahir Reddy e53eb4f015 [SPARK-986]: Job cancelation for PySpark
* Additions to the PySpark API to cancel jobs
* Monitor Thread in PythonRDD to kill Python workers if a task is interrupted

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #541 from ahirreddy/python-cancel and squashes the following commits:

dfdf447 [Ahir Reddy] Changed success -> completed and made logging message clearer
6c860ab [Ahir Reddy] PR Comments
4b4100a [Ahir Reddy] Success flag
adba6ed [Ahir Reddy] Destroy python workers
27a2f8f [Ahir Reddy] Start the writer thread...
d422f7b [Ahir Reddy] Remove unnecesssary vals
adda337 [Ahir Reddy] Busy wait on the ocntext.interrupted flag, and then kill the python worker
d9e472f [Ahir Reddy] Revert "removed unnecessary vals"
5b9cae5 [Ahir Reddy] removed unnecessary vals
07b54d9 [Ahir Reddy] Fix canceling unit test
8ae9681 [Ahir Reddy] Don't interrupt worker
7722342 [Ahir Reddy] Monitor Thread for python workers
db04e16 [Ahir Reddy] Added canceling api to PySpark
2014-04-24 20:21:10 -07:00
Andrew Or ee6f7e22a4 [SPARK-1615] Synchronize accesses to the LiveListenerBus' event queue
Original poster is @zsxwing, who reported this bug in #516.

Much of SparkListenerSuite relies on LiveListenerBus's `waitUntilEmpty()` method. As the name suggests, this waits until the event queue is empty. However, the following race condition could happen:

(1) We dequeue an event
(2) The queue is empty, we return true (even though the event has not been processed)
(3) The test asserts something assuming that all listeners have finished executing (and fails)
(4) The listeners receive and process the event

This PR makes (1) and (4) atomic by synchronizing around it. To do that, however, we must avoid using `eventQueue.take`, which is blocking and will cause a deadlock if we synchronize around it. As a workaround, we use the non-blocking `eventQueue.poll` + a semaphore to provide the same semantics.

This has been a possible race condition for a long time, but for some reason we've never run into it.

Author: Andrew Or <andrewor14@gmail.com>

Closes #544 from andrewor14/stage-info-test-fix and squashes the following commits:

3cbe40c [Andrew Or] Merge github.com:apache/spark into stage-info-test-fix
56dbbcb [Andrew Or] Check if event is actually added before releasing semaphore
eb486ae [Andrew Or] Synchronize accesses to the LiveListenerBus' event queue
2014-04-24 20:18:15 -07:00
jerryshao 80429f3e2a [SPARK-1510] Spark Streaming metrics source for metrics system
This pulls in changes made by @jerryshao in https://github.com/apache/spark/pull/424 and merges with the master.

Author: jerryshao <saisai.shao@intel.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #545 from tdas/streaming-metrics and squashes the following commits:

034b443 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-metrics
fb3b0a5 [jerryshao] Modify according master update
21939f5 [jerryshao] Style changes according to style check error
976116b [jerryshao] Add StreamSource in StreamingContext for better monitoring through metrics system
2014-04-24 18:56:57 -07:00
Thomas Graves 44da5ab2de Spark 1489 Fix the HistoryServer view acls
This allows the view acls set by the user to be enforced by the history server.  It also fixes filters being applied properly.

Author: Thomas Graves <tgraves@apache.org>

Closes #509 from tgravescs/SPARK-1489 and squashes the following commits:

869c186 [Thomas Graves] change to either acls enabled or disabled
0d8333c [Thomas Graves] Add history ui policy to allow acls to either use application set, history server force acls on, or off
65148b5 [Thomas Graves] SPARK-1489 Fix the HistoryServer view acls
2014-04-24 18:38:10 -07:00
Michael Armbrust 4660991e67 [SQL] Add support for parsing indexing into arrays in SQL.
Author: Michael Armbrust <michael@databricks.com>

Closes #518 from marmbrus/parseArrayIndex and squashes the following commits:

afd2d6b [Michael Armbrust] 100 chars
c3d6026 [Michael Armbrust] Add support for parsing indexing into arrays in SQL.
2014-04-24 18:21:00 -07:00
Tathagata Das 526a518bf3 [SPARK-1592][streaming] Automatically remove streaming input blocks
The raw input data is stored as blocks in BlockManagers. Earlier they were cleared by cleaner ttl. Now since streaming does not require cleaner TTL to be set, the block would not get cleared. This increases up the Spark's memory usage, which is not even accounted and shown in the Spark storage UI. It may cause the data blocks to spill over to disk, which eventually slows down the receiving of data (persisting to memory become bottlenecked by writing to disk).

The solution in this PR is to automatically remove those blocks. The mechanism to keep track of which BlockRDDs (which has presents the raw data blocks as a RDD) can be safely cleared already exists. Just use it to explicitly remove blocks from BlockRDDs.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #512 from tdas/block-rdd-unpersist and squashes the following commits:

d25e610 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist
5f46d69 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist
2c320cd [Tathagata Das] Updated configuration with spark.streaming.unpersist setting.
2d4b2fd [Tathagata Das] Automatically removed input blocks
2014-04-24 18:18:22 -07:00
Arun Ramakrishnan 35e3d199f0 SPARK-1438 RDD.sample() make seed param optional
copying form previous pull request https://github.com/apache/spark/pull/462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <smartnut007@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
2014-04-24 17:27:16 -07:00
CodingCat f99af8529b SPARK-1104: kill Process in workerThread of ExecutorRunner
As reported in https://spark-project.atlassian.net/browse/SPARK-1104

By @pwendell: "Sometimes due to large shuffles executors will take a long time shutting down. In particular this can happen if large numbers of shuffle files are around (this will be alleviated by SPARK-1103, but nonetheless...).

The symptom is you have DEAD workers sitting around in the UI and the existing workers keep trying to re-register but can't because they've been assumed dead."

In this patch, I add lines in the handler of InterruptedException in workerThread of executorRunner, so that the process.destroy() and process.waitFor() can only block the workerThread instead of blocking the worker Actor...

---------

analysis: process.destroy() is a blocking method, i.e. it only returns when all shutdownHook threads return...so calling it in Worker thread will make Worker block for a long while....

about what will happen on the shutdown hooks when the JVM process is killed: http://www.tutorialspoint.com/java/lang/runtime_addshutdownhook.htm

Author: CodingCat <zhunansjtu@gmail.com>

Closes #35 from CodingCat/SPARK-1104 and squashes the following commits:

85767da [CodingCat] add null checking and remove unnecessary killProce
3107aeb [CodingCat] address Aaron's comments
eb615ba [CodingCat] kill the process when the error happens
0accf2f [CodingCat] set process to null after killed it
1d511c8 [CodingCat] kill Process in workerThread
2014-04-24 15:55:18 -07:00
Sandeep a03ac222d8 Fix Scala Style
Any comments are welcome

Author: Sandeep <sandeep@techaddict.me>

Closes #531 from techaddict/stylefix-1 and squashes the following commits:

7492730 [Sandeep] Pass 4
98b2428 [Sandeep] fix rxin suggestions
b5e2e6f [Sandeep] Pass 3
05932d7 [Sandeep] fix if else styling 2
08690e5 [Sandeep] fix if else styling
2014-04-24 15:07:23 -07:00
Michael Armbrust c5c1916dd1 SPARK-1494 Don't initialize classes loaded by MIMA excludes, attempt 2
[WIP]

Looks like scala reflection was invoking the static initializer:
```
...
	at org.apache.spark.sql.test.TestSQLContext$.<init>(TestSQLContext.scala:25)
	at org.apache.spark.sql.test.TestSQLContext$.<clinit>(TestSQLContext.scala)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:500)
	at scala.reflect.runtime.JavaMirrors$JavaMirror.tryJavaClass(JavaMirrors.scala:505)
	at scala.reflect.runtime.SymbolLoaders$PackageScope.lookupEntry(SymbolLoaders.scala:109)
...
```

Need to make sure that this doesn't change the exclusion semantics before merging.

Author: Michael Armbrust <michael@databricks.com>

Closes #526 from marmbrus/mima and squashes the following commits:

8168dea [Michael Armbrust] Spurious change
afba262 [Michael Armbrust] Prevent Scala reflection from running static class initializer.
2014-04-24 14:54:23 -07:00
Thomas Graves bd375094a1 Spark 1490 Add kerberos support to the HistoryServer
Here I've added the ability for the History server to login from a kerberos keytab file so that the history server can be run as a super user and stay up for along period of time while reading the history files from HDFS.

Author: Thomas Graves <tgraves@apache.org>

Closes #513 from tgravescs/SPARK-1490 and squashes the following commits:

e204a99 [Thomas Graves] remove extra logging
5418daa [Thomas Graves] fix typo in config
0076b99 [Thomas Graves] Update docs
4d76545 [Thomas Graves] SPARK-1490 Add kerberos support to the HistoryServer
2014-04-24 11:16:30 -07:00
zsxwing 78a49b2532 SPARK-1611: Fix incorrect initialization order in AppendOnlyMap
JIRA: https://issues.apache.org/jira/browse/SPARK-1611

Author: zsxwing <zsxwing@gmail.com>

Closes #534 from zsxwing/SPARK-1611 and squashes the following commits:

96af089 [zsxwing] SPARK-1611: Fix incorrect initialization order in AppendOnlyMap
2014-04-24 11:13:40 -07:00
Sean Owen 6338a93f10 SPARK-1488. Squash more language feature warnings in new commits by importing implicitConversion
A recent commit reintroduced some of the same warnings that SPARK-1488 resolved. These are just a few more of the same changes to remove these warnings.

Author: Sean Owen <sowen@cloudera.com>

Closes #528 from srowen/SPARK-1488.2 and squashes the following commits:

62d592c [Sean Owen] More feature warnings in tests
4e2e94b [Sean Owen] Squash more language feature warnings in new commits by importing implicitConversion
2014-04-24 10:06:18 -07:00
Patrick Wendell faeb761cbe Small changes to release script 2014-04-24 10:00:34 -07:00
Takuya UESHIN 27b2821cf1 [SPARK-1610] [SQL] Fix Cast to use exact type value when cast from BooleanType to NumericTy...
...pe.

`Cast` from `BooleanType` to `NumericType` are all using `Int` value.
But it causes `ClassCastException` when the casted value is used by the following evaluation like the code below:

``` scala
scala> import org.apache.spark.sql.catalyst._
import org.apache.spark.sql.catalyst._

scala> import types._
import types._

scala> import expressions._
import expressions._

scala> Add(Cast(Literal(true), ShortType), Literal(1.toShort)).eval()
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short
	at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102)
	at scala.math.Numeric$ShortIsIntegral$.plus(Numeric.scala:72)
	at org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58)
	at org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58)
	at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114)
	at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58)
	at .<init>(<console>:17)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
	at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
	at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
	at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
	at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
	at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
	at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
	at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
	at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
	at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
	at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
	at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
```

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #533 from ueshin/issues/SPARK-1610 and squashes the following commits:

70f36e8 [Takuya UESHIN] Fix Cast to use exact type value when cast from BooleanType to NumericType.
2014-04-24 09:57:28 -07:00
Reynold Xin 1fdf659d2f SPARK-1601 & SPARK-1602: two bug fixes related to cancellation
This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

1. Do not put partially executed partitions into cache (in task killing).

2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

Author: Reynold Xin <rxin@apache.org>

Closes #521 from rxin/kill and squashes the following commits:

401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill
7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala.
35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).
2014-04-24 00:27:45 -07:00
Mridul Muralidharan dd681f502e SPARK-1587 Fix thread leak
mvn test fails (intermittently) due to thread leak - since scalatest runs all tests in same vm.

Author: Mridul Muralidharan <mridulm80@apache.org>

Closes #504 from mridulm/resource_leak_fixes and squashes the following commits:

a5d10d0 [Mridul Muralidharan] Prevent thread leaks while running tests : cleanup all threads when SparkContext.stop is invoked. Causes tests to fail
7b5e19c [Mridul Muralidharan] Prevent NPE while running tests
2014-04-23 23:20:55 -07:00
Sandeep bb68f47745 [Fix #79] Replace Breakable For Loops By While Loops
Author: Sandeep <sandeep@techaddict.me>

Closes #503 from techaddict/fix-79 and squashes the following commits:

e3f6746 [Sandeep] Style changes
07a4f6b [Sandeep] for loop to While loop
0a6d8e9 [Sandeep] Breakable for loop to While loop
2014-04-23 22:47:59 -07:00
zsxwing 6ab7578067 SPARK-1589: Fix the incorrect compare
JIRA: https://issues.apache.org/jira/browse/SPARK-1589

Author: zsxwing <zsxwing@gmail.com>

Closes #508 from zsxwing/SPARK-1589 and squashes the following commits:

570c67a [zsxwing] SPARK-1589: Fix the incorrect compare
2014-04-23 22:36:02 -07:00
Ankur Dave 1d6abe3a4b Mark all fields of EdgePartition, Graph, and GraphOps transient
These classes are only serializable to work around closure capture, so their fields should all be marked `@transient` to avoid wasteful serialization.

This PR supersedes apache/spark#519 and fixes the same bug.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #520 from ankurdave/graphx-transient and squashes the following commits:

6431760 [Ankur Dave] Mark all fields of EdgePartition, Graph, and GraphOps `@transient`
2014-04-23 22:01:13 -07:00
Aaron Davidson d485eecb72 Update Java api for setJobGroup with interruptOnCancel
Also adds a unit test.

Author: Aaron Davidson <aaron@databricks.com>

Closes #522 from aarondav/cancel2 and squashes the following commits:

565c253 [Aaron Davidson] Update Java api for setJobGroup with interruptOnCancel
65b33d8 [Aaron Davidson] Add unit test for Thread interruption on cancellation
2014-04-23 22:00:22 -07:00
Andrew Or 4b2bab1d08 [Hot Fix #469] Fix flaky test in SparkListenerSuite
The two modified tests may fail if the race condition does not bid in our favor...

Author: Andrew Or <andrewor14@gmail.com>

Closes #516 from andrewor14/stage-info-test-fix and squashes the following commits:

b4b6100 [Andrew Or] Add/replace missing waitUntilEmpty() calls to listener bus
2014-04-23 21:59:33 -07:00
Matei Zaharia 640f9a0efe [SPARK-1540] Add an optional Ordering parameter to PairRDDFunctions.
In https://issues.apache.org/jira/browse/SPARK-1540 we'd like to look at Spark's API to see if we can take advantage of Comparable keys in more places, which will make external spilling more efficient. This PR is a first step towards that that shows how to pass an Ordering when available and still continue functioning otherwise. It does this using a new implicit parameter with a default value of null.

The API is currently only in Scala -- in Java we'd have to add new versions of mapToPair and such that take a Comparator, or a new method to add a "type hint" to an RDD. We can address those later though.

Unfortunately requiring all keys to be Comparable would not work without requiring RDDs in general to contain only Comparable types. The reason is that methods such as distinct() and intersection() do a shuffle, but should be usable on RDDs of any type. So ordering will have to remain an optimization for the types that can be ordered. I think this isn't a horrible outcome though because one of the nice things about Spark's API is that it works on objects of *any* type, without requiring you to specify a schema or implement Writable or stuff like that.

Author: Matei Zaharia <matei@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@apache.org>

Closes #487 from mateiz/ordered-keys and squashes the following commits:

bd565f6 [Matei Zaharia] Pass an Ordering to only one version of groupBy because the Scala language spec doesn't allow having an optional parameter on all of them (this was only compiling in Scala 2.10 due to a bug).
4629965 [Matei Zaharia] Add tests for other versions of groupBy
3beae85 [Matei Zaharia] Added a test for implicit orderings
80b7a3b [Matei Zaharia] Add an optional Ordering parameter to PairRDDFunctions.
2014-04-23 17:03:54 -07:00
Aaron Davidson 432201c7ee SPARK-1582 Invoke Thread.interrupt() when cancelling jobs
Sometimes executor threads are blocked waiting for IO or monitors, and the current implementation of job cancellation may never recover these threads. By simply invoking Thread.interrupt() during cancellation, we can often safely unblock the threads and use them for subsequent work.

Note that this feature must remain optional for now because of a bug in HDFS where Thread.interrupt() may cause nodes to be marked as permanently dead (as the InterruptedException is reinterpreted as an IOException during communication with some node).

Author: Aaron Davidson <aaron@databricks.com>

Closes #498 from aarondav/cancel and squashes the following commits:

e52b829 [Aaron Davidson] Don't use job.properties when null
82f78bb [Aaron Davidson] Update DAGSchedulerSuite
b67f472 [Aaron Davidson] Add comment on why interruptOnCancel is in setJobGroup
4cb9fd6 [Aaron Davidson] SPARK-1582 Invoke Thread.interrupt() when cancelling jobs
2014-04-23 16:52:49 -07:00
Marcelo Vanzin dd1b7a61d9 Honor default fs name when initializing event logger.
This is related to SPARK-1459 / PR #375. Without this fix,
FileLogger.createLogDir() may try to create the log dir on
HDFS, while createWriter() will try to open the log file on
the local file system, leading to interesting errors and
confusion.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #450 from vanzin/event-file-2 and squashes the following commits:

592cdb3 [Marcelo Vanzin] Honor default fs name when initializing event logger.
2014-04-23 14:47:38 -07:00
Aaron Davidson a967b005c8 SPARK-1572 Don't kill Executor if PythonRDD fails while computing parent
Previously, the behavior was that if the parent RDD threw any exception other than IOException or FileNotFoundException (which is quite possible for Hadoop input sources), the entire Executor would crash, because the default thread a uncaught exception handler calls System.exit().

This patch avoids two related issues:

  1. Always catch exceptions in this reader thread.
  2. Don't mask readerException when Python throws an EOFError
     after worker.shutdownOutput() is called.

Author: Aaron Davidson <aaron@databricks.com>

Closes #486 from aarondav/pyspark and squashes the following commits:

fbb11e9 [Aaron Davidson] Make sure FileNotFoundExceptions are handled same as before
b9acb3e [Aaron Davidson] SPARK-1572 Don't kill Executor if PythonRDD fails while computing parent
2014-04-23 14:46:30 -07:00
zsxwing a664606613 SPARK-1583: Fix a bug that using java.util.HashMap by mistake
JIRA: https://issues.apache.org/jira/browse/SPARK-1583

Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems.

Author: zsxwing <zsxwing@gmail.com>

Closes #500 from zsxwing/SPARK-1583 and squashes the following commits:

7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake
2014-04-23 14:12:20 -07:00
Patrick Wendell cd4ed29326 SPARK-1119 and other build improvements
1. Makes assembly and examples jar naming consistent in maven/sbt.
2. Updates make-distribution.sh to use Maven and fixes some bugs.
3. Updates the create-release script to call make-distribution script.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #502 from pwendell/make-distribution and squashes the following commits:

1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements
2014-04-23 10:19:32 -07:00
Michael Armbrust 39f85e0322 [SQL] SPARK-1571 Mistake in java example code
Author: Michael Armbrust <michael@databricks.com>

Closes #496 from marmbrus/javaBeanBug and squashes the following commits:

644fedd [Michael Armbrust] Bean methods must be public.
2014-04-22 22:19:32 -07:00