Commit graph

7800 commits

Author SHA1 Message Date
Andrew Or 4b2bab1d08 [Hot Fix #469] Fix flaky test in SparkListenerSuite
The two modified tests may fail if the race condition does not bid in our favor...

Author: Andrew Or <andrewor14@gmail.com>

Closes #516 from andrewor14/stage-info-test-fix and squashes the following commits:

b4b6100 [Andrew Or] Add/replace missing waitUntilEmpty() calls to listener bus
2014-04-23 21:59:33 -07:00
Matei Zaharia 640f9a0efe [SPARK-1540] Add an optional Ordering parameter to PairRDDFunctions.
In https://issues.apache.org/jira/browse/SPARK-1540 we'd like to look at Spark's API to see if we can take advantage of Comparable keys in more places, which will make external spilling more efficient. This PR is a first step towards that that shows how to pass an Ordering when available and still continue functioning otherwise. It does this using a new implicit parameter with a default value of null.

The API is currently only in Scala -- in Java we'd have to add new versions of mapToPair and such that take a Comparator, or a new method to add a "type hint" to an RDD. We can address those later though.

Unfortunately requiring all keys to be Comparable would not work without requiring RDDs in general to contain only Comparable types. The reason is that methods such as distinct() and intersection() do a shuffle, but should be usable on RDDs of any type. So ordering will have to remain an optimization for the types that can be ordered. I think this isn't a horrible outcome though because one of the nice things about Spark's API is that it works on objects of *any* type, without requiring you to specify a schema or implement Writable or stuff like that.

Author: Matei Zaharia <matei@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@apache.org>

Closes #487 from mateiz/ordered-keys and squashes the following commits:

bd565f6 [Matei Zaharia] Pass an Ordering to only one version of groupBy because the Scala language spec doesn't allow having an optional parameter on all of them (this was only compiling in Scala 2.10 due to a bug).
4629965 [Matei Zaharia] Add tests for other versions of groupBy
3beae85 [Matei Zaharia] Added a test for implicit orderings
80b7a3b [Matei Zaharia] Add an optional Ordering parameter to PairRDDFunctions.
2014-04-23 17:03:54 -07:00
Aaron Davidson 432201c7ee SPARK-1582 Invoke Thread.interrupt() when cancelling jobs
Sometimes executor threads are blocked waiting for IO or monitors, and the current implementation of job cancellation may never recover these threads. By simply invoking Thread.interrupt() during cancellation, we can often safely unblock the threads and use them for subsequent work.

Note that this feature must remain optional for now because of a bug in HDFS where Thread.interrupt() may cause nodes to be marked as permanently dead (as the InterruptedException is reinterpreted as an IOException during communication with some node).

Author: Aaron Davidson <aaron@databricks.com>

Closes #498 from aarondav/cancel and squashes the following commits:

e52b829 [Aaron Davidson] Don't use job.properties when null
82f78bb [Aaron Davidson] Update DAGSchedulerSuite
b67f472 [Aaron Davidson] Add comment on why interruptOnCancel is in setJobGroup
4cb9fd6 [Aaron Davidson] SPARK-1582 Invoke Thread.interrupt() when cancelling jobs
2014-04-23 16:52:49 -07:00
Marcelo Vanzin dd1b7a61d9 Honor default fs name when initializing event logger.
This is related to SPARK-1459 / PR #375. Without this fix,
FileLogger.createLogDir() may try to create the log dir on
HDFS, while createWriter() will try to open the log file on
the local file system, leading to interesting errors and
confusion.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #450 from vanzin/event-file-2 and squashes the following commits:

592cdb3 [Marcelo Vanzin] Honor default fs name when initializing event logger.
2014-04-23 14:47:38 -07:00
Aaron Davidson a967b005c8 SPARK-1572 Don't kill Executor if PythonRDD fails while computing parent
Previously, the behavior was that if the parent RDD threw any exception other than IOException or FileNotFoundException (which is quite possible for Hadoop input sources), the entire Executor would crash, because the default thread a uncaught exception handler calls System.exit().

This patch avoids two related issues:

  1. Always catch exceptions in this reader thread.
  2. Don't mask readerException when Python throws an EOFError
     after worker.shutdownOutput() is called.

Author: Aaron Davidson <aaron@databricks.com>

Closes #486 from aarondav/pyspark and squashes the following commits:

fbb11e9 [Aaron Davidson] Make sure FileNotFoundExceptions are handled same as before
b9acb3e [Aaron Davidson] SPARK-1572 Don't kill Executor if PythonRDD fails while computing parent
2014-04-23 14:46:30 -07:00
zsxwing a664606613 SPARK-1583: Fix a bug that using java.util.HashMap by mistake
JIRA: https://issues.apache.org/jira/browse/SPARK-1583

Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems.

Author: zsxwing <zsxwing@gmail.com>

Closes #500 from zsxwing/SPARK-1583 and squashes the following commits:

7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake
2014-04-23 14:12:20 -07:00
Patrick Wendell cd4ed29326 SPARK-1119 and other build improvements
1. Makes assembly and examples jar naming consistent in maven/sbt.
2. Updates make-distribution.sh to use Maven and fixes some bugs.
3. Updates the create-release script to call make-distribution script.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #502 from pwendell/make-distribution and squashes the following commits:

1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements
2014-04-23 10:19:32 -07:00
Michael Armbrust 39f85e0322 [SQL] SPARK-1571 Mistake in java example code
Author: Michael Armbrust <michael@databricks.com>

Closes #496 from marmbrus/javaBeanBug and squashes the following commits:

644fedd [Michael Armbrust] Bean methods must be public.
2014-04-22 22:19:32 -07:00
Michael Armbrust 8e95081333 SPARK-1494 Don't initialize classes loaded by MIMA excludes.
[WIP]  Just seeing how Jenkins likes this...

Author: Michael Armbrust <michael@databricks.com>

Closes #494 from marmbrus/mima and squashes the following commits:

6eec616 [Michael Armbrust] Force hive tests to run.
acaf682 [Michael Armbrust] Don't initialize loaded classes.
2014-04-22 22:02:42 -07:00
Michael Armbrust aa77f8a6a6 SPARK-1562 Fix visibility / annotation of Spark SQL APIs
Author: Michael Armbrust <michael@databricks.com>

Closes #489 from marmbrus/sqlDocFixes and squashes the following commits:

acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs
2014-04-22 20:02:33 -07:00
Xiangrui Meng 662c860ebc [FIX: SPARK-1376] use --arg instead of --args in SparkSubmit to avoid warning messages
Even if users use `--arg`, `SparkSubmit` still uses `--args` for child args internally, which triggers a warning message that may confuse users:

~~~
--args is deprecated. Use --arg instead.
~~~

@sryza Does it look good to you?

Author: Xiangrui Meng <meng@databricks.com>

Closes #485 from mengxr/submit-arg and squashes the following commits:

5e1b9fe [Xiangrui Meng] update test
cebbeb7 [Xiangrui Meng] use --arg instead of --args in SparkSubmit to avoid warning messages
2014-04-22 19:38:27 -07:00
Tathagata Das f3d19a9f1a [streaming][SPARK-1578] Removed requirement for TTL in StreamingContext.
Since shuffles and RDDs that are out of context are automatically cleaned by Spark core (using ContextCleaner) there is no need for setting the cleaner TTL while creating a StreamingContext.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #491 from tdas/ttl-fix and squashes the following commits:

cf01dc7 [Tathagata Das] Removed requirement for TTL in StreamingContext.
2014-04-22 19:35:13 -07:00
Andrew Or 2de573877f [Spark-1538] Fix SparkUI incorrectly hiding persisted RDDs
**Bug**: After the following command `sc.parallelize(1 to 1000).persist.map(_ + 1).count()` is run, the the persisted RDD is missing from the storage tab of the SparkUI.

**Cause**: The command creates two RDDs in one stage, a `ParallelCollectionRDD` and a `MappedRDD`. However, the existing StageInfo only keeps the RDDInfo of the last RDD associated with the stage (`MappedRDD`), and so all RDD information regarding the first RDD (`ParallelCollectionRDD`) is discarded. In this case, we persist the first RDD,  but the StorageTab doesn't know about this RDD because it is not encoded in the StageInfo.

**Fix**: Record information of all RDDs in StageInfo, instead of just the last RDD (i.e. `stage.rdd`). Since stage boundaries are marked by shuffle dependencies, the solution is to traverse the last RDD's dependency tree, visiting only ancestor RDDs related through a sequence of narrow dependencies.

---

This PR also moves RDDInfo to its own file, includes a few style fixes, and adds a unit test for constructing StageInfos.

Author: Andrew Or <andrewor14@gmail.com>

Closes #469 from andrewor14/storage-ui-fix and squashes the following commits:

07fc7f0 [Andrew Or] Add back comment that was accidentally removed (minor)
5d799fe [Andrew Or] Add comment to justify testing of getNarrowAncestors with cycles
9d0e2b8 [Andrew Or] Hide details of getNarrowAncestors from outsiders
d2bac8a [Andrew Or] Deal with cycles in RDD dependency graph + add extensive tests
2acb177 [Andrew Or] Move getNarrowAncestors to RDD.scala
bfe83f0 [Andrew Or] Backtrace RDD dependency tree to find all RDDs that belong to a Stage
2014-04-22 19:24:03 -07:00
Patrick Wendell 995fdc96bc Assorted clean-up for Spark-on-YARN.
In particular when the HADOOP_CONF_DIR is not not specified.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #488 from pwendell/hadoop-cleanup and squashes the following commits:

fe95f13 [Patrick Wendell] Changes based on Andrew's feeback
18d09c1 [Patrick Wendell] Review comments from Andrew
17929cc [Patrick Wendell] Assorted clean-up for Spark-on-YARN.
2014-04-22 19:22:06 -07:00
Kan Zhang ea8cea82a0 [SPARK-1570] Fix classloading in JavaSQLContext.applySchema
I think I hit a class loading issue when running JavaSparkSQL example using spark-submit in local mode.

Author: Kan Zhang <kzhang@apache.org>

Closes #484 from kanzhang/SPARK-1570 and squashes the following commits:

feaaeba [Kan Zhang] [SPARK-1570] Fix classloading in JavaSQLContext.applySchema
2014-04-22 15:05:12 -07:00
Marcelo Vanzin 0ea0b1a2d6 Fix compilation on Hadoop 2.4.x.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #483 from vanzin/yarn-2.4 and squashes the following commits:

0fc57d8 [Marcelo Vanzin] Fix compilation on Hadoop 2.4.x.
2014-04-22 14:28:41 -07:00
Andrew Or 745e496c59 [Fix #204] Eliminate delay between binding and log checking
**Bug**: In the existing history server, there is a `spark.history.updateInterval` seconds delay before application logs show up on the UI.

**Cause**: This is because the following events happen in this order: (1) The background thread that checks for logs starts, but realizes the server has not yet bound and so waits for N seconds, (2) server binds, (3) N seconds later the background thread finds that the server has finally bound to a port, and so finally checks for application logs.

**Fix**: This PR forces the log checking thread to start immediately after binding. It also documents two relevant environment variables that are currently missing.

Author: Andrew Or <andrewor14@gmail.com>

Closes #441 from andrewor14/history-server-fix and squashes the following commits:

b2eb46e [Andrew Or] Document SPARK_PUBLIC_DNS and SPARK_HISTORY_OPTS for the history server
e8d1fbc [Andrew Or] Eliminate delay between binding and checking for logs
2014-04-22 14:27:49 -07:00
Xiangrui Meng 26d35f3fd9 [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html

Table of contents:

* Basics
  * Data types
  * Summary statistics
* Classification and regression
  * linear support vector machine (SVM)
  * logistic regression
  * linear linear squares, Lasso, and ridge regression
  * decision tree
  * naive Bayes
* Collaborative Filtering
  * alternating least squares (ALS)
* Clustering
  * k-means
* Dimensionality reduction
  * singular value decomposition (SVD)
  * principal component analysis (PCA)
* Optimization
  * stochastic gradient descent
  * limited-memory BFGS (L-BFGS)

Author: Xiangrui Meng <meng@databricks.com>

Closes #422 from mengxr/mllib-doc and squashes the following commits:

944e3a9 [Xiangrui Meng] merge master
f9fda28 [Xiangrui Meng] minor
9474065 [Xiangrui Meng] add alpha to ALS examples
928e630 [Xiangrui Meng] initialization_mode -> initializationMode
5bbff49 [Xiangrui Meng] add imports to labeled point examples
c17440d [Xiangrui Meng] fix python nb example
28f40dc [Xiangrui Meng] remove localhost:4000
369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc
7dc95cc [Xiangrui Meng] update linear methods
053ad8a [Xiangrui Meng] add links to go back to the main page
abbbf7e [Xiangrui Meng] update ALS argument names
648283e [Xiangrui Meng] level down statistics
14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide
8cd2441 [Xiangrui Meng] minor updates
186ab07 [Xiangrui Meng] update section names
6568d65 [Xiangrui Meng] update toc, level up lr and svm
162ee12 [Xiangrui Meng] rename section names
5c1e1b1 [Xiangrui Meng] minor
8aeaba1 [Xiangrui Meng] wrap long lines
6ce6a6f [Xiangrui Meng] add summary statistics to toc
5760045 [Xiangrui Meng] claim beta
cc604bf [Xiangrui Meng] remove classification and regression
92747b3 [Xiangrui Meng] make section titles consistent
e605dd6 [Xiangrui Meng] add LIBSVM loader
f639674 [Xiangrui Meng] add python section to migration guide
c82ffb4 [Xiangrui Meng] clean optimization
31660eb [Xiangrui Meng] update linear algebra and stat
0a40837 [Xiangrui Meng] first pass over linear methods
1fc8271 [Xiangrui Meng] update toc
906ed0a [Xiangrui Meng] add a python example to naive bayes
5f0a700 [Xiangrui Meng] update collaborative filtering
656d416 [Xiangrui Meng] update mllib-clustering
86e143a [Xiangrui Meng] remove data types section from main page
8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples
d1b5cbf [Xiangrui Meng] merge master
72e4804 [Xiangrui Meng] one pass over tree guide
64f8995 [Xiangrui Meng] move decision tree guide to a separate file
9fca001 [Xiangrui Meng] add first version of linear algebra guide
53c9552 [Xiangrui Meng] update dependencies
f316ec2 [Xiangrui Meng] add migration guide
f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction
182460f [Xiangrui Meng] add guide for naive Bayes
137fd1d [Xiangrui Meng] re-organize toc
a61e434 [Xiangrui Meng] update mllib's toc
2014-04-22 11:20:47 -07:00
Tor Myklebust bf9d49b6d1 [SPARK-1281] Improve partitioning in ALS
ALS was using HashPartitioner and explicit uses of `%` together.  Further, the naked use of `%` meant that, if the number of partitions corresponded with the stride of arithmetic progressions appearing in user and product ids, users and products could be mapped into buckets in an unfair or unwise way.

This pull request:
1) Makes the Partitioner an instance variable of ALS.
2) Replaces the direct uses of `%` with calls to a Partitioner.
3) Defines an anonymous Partitioner that scrambles the bits of the object's hashCode before reducing to the number of present buckets.

This pull request does not make the partitioner user-configurable.

I'm not all that happy about the way I did (1).  It introduces an icky lifetime issue and dances around it by nulling something.  However, I don't know a better way to make the partitioner visible everywhere it needs to be visible.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #407 from tmyklebu/master and squashes the following commits:

dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-04-22 11:07:30 -07:00
Xusen Yin c919798f09 fix bugs of dot in python
If there are no `transpose()` in `self.theta`, a

*ValueError: matrices are not aligned*

is occurring. The former test case just ignore this situation.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #463 from yinxusen/python-naive-bayes and squashes the following commits:

fcbe3bc [Xusen Yin] fix bugs of dot in python
2014-04-22 11:06:18 -07:00
Ahir Reddy 0f87e6ad43 [SPARK-1560]: Updated Pyrolite Dependency to be Java 6 compatible
Changed the Pyrolite dependency to a build which targets Java 6.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits:

8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite
dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible
2014-04-22 09:44:41 -07:00
CodingCat 87de29084e [HOTFIX] SPARK-1399: remove outdated comments
as the original PR was merged before this mistake is found....fix here,

Sorry about that @pwendell, @andrewor14, I will be more careful next time

Author: CodingCat <zhunansjtu@gmail.com>

Closes #474 from CodingCat/hotfix_1399 and squashes the following commits:

f3a8ba9 [CodingCat] move outdated comments
2014-04-22 09:43:13 -07:00
Patrick Wendell 83084d3b7b SPARK-1496: Have jarOfClass return Option[String]
A simple change, mostly had to change a bunch of example code.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #438 from pwendell/jar-of-class and squashes the following commits:

aa010ff [Patrick Wendell] SPARK-1496: Have jarOfClass return Option[String]
2014-04-22 00:42:16 -07:00
Marcelo Vanzin ac164b79d1 [SPARK-1459] Use local path (and not complete URL) when opening local lo...
...g file.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #375 from vanzin/event-file and squashes the following commits:

f673029 [Marcelo Vanzin] [SPARK-1459] Use local path (and not complete URL) when opening local log file.
2014-04-21 23:10:53 -07:00
Andrew Or b3e5366f69 [Fix #274] Document + fix annotation usages
... so that we don't follow an unspoken set of forbidden rules for adding **@AlphaComponent**, **@DeveloperApi**, and **@Experimental** annotations in the code.

In addition, this PR
(1) removes unnecessary `:: * ::` tags,
(2) adds missing `:: * ::` tags, and
(3) removes annotations for internal APIs.

Author: Andrew Or <andrewor14@gmail.com>

Closes #470 from andrewor14/annotations-fix and squashes the following commits:

92a7f42 [Andrew Or] Document + fix annotation usages
2014-04-21 22:24:44 -07:00
Matei Zaharia fc78384704 [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs
I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.

Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/

Author: Matei Zaharia <matei@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>

Closes #457 from mateiz/better-docs and squashes the following commits:

a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
f05abc0 [Matei Zaharia] Don't include java.lang package names
995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
a14a93c [Matei Zaharia] typo
76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
2014-04-21 21:57:40 -07:00
Tathagata Das 04c37b6f74 [SPARK-1332] Improve Spark Streaming's Network Receiver and InputDStream API [WIP]
The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51

Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability.

Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. This is still not yet implemented.

This PR is blocked on the graceful shutdown PR #247

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #300 from tdas/network-receiver-api and squashes the following commits:

ea27b38 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3a4777c [Tathagata Das] Renamed NetworkInputDStream to ReceiverInputDStream, and ActorReceiver related stuff.
838dd39 [Tathagata Das] Added more events to the StreamingListener to report errors and stopped receivers.
a75c7a6 [Tathagata Das] Address some PR comments and fixed other issues.
91bfa72 [Tathagata Das] Fixed bugs.
8533094 [Tathagata Das] Scala style fixes.
028bde6 [Tathagata Das] Further refactored receiver to allow restarting of a receiver.
43f5290 [Tathagata Das] Made functions that create input streams return InputDStream and NetworkInputDStream, for both Scala and Java.
2c94579 [Tathagata Das] Fixed graceful shutdown by removing interrupts on receiving thread.
9e37a0b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into network-receiver-api
3223e95 [Tathagata Das] Refactored the code that runs the NetworkReceiver into further classes and traits to make them more testable.
a36cc48 [Tathagata Das] Refactored the NetworkReceiver API for future stability.
2014-04-21 19:04:49 -07:00
Patrick Wendell 5a5b3346c7 Dev script: include RC name in git tag 2014-04-21 14:21:17 -07:00
CodingCat 43e4a29dac SPARK-1399: show stage failure reason in UI
https://issues.apache.org/jira/browse/SPARK-1399

refactor StageTable a bit to support additional column for failed stage

Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>

Closes #421 from CodingCat/SPARK-1399 and squashes the following commits:

2caba36 [CodingCat] remove dummy tag
77cf305 [CodingCat] create dummy element to wrap columns
3989ce2 [CodingCat] address Aaron's comments
18fc09f [Nan Zhu] fix compile error
00ea30a [Nan Zhu] address Kay's comments
16ac83d [CodingCat] set a default value of failureReason
35df3df [CodingCat] address andrew's comments
06d21a4 [CodingCat] address andrew's comments
25a6db6 [CodingCat] style fix
dc8856d [CodingCat] show stage failure reason in UI
2014-04-21 14:10:23 -07:00
Xiangrui Meng b7df31eb34 SPARK-1539: RDDPage.scala contains RddPage class
SPARK-1386 changed RDDPage to RddPage but didn't change the filename. I tried sbt/sbt publish-local. Inside the spark-core jar, the unit name is RDDPage.class and hence I got the following error:

~~~
[error] (run-main) java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage
java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage
	at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59)
	at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:52)
	at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:215)
	at MovieLensALS$.main(MovieLensALS.scala:38)
	at MovieLensALS.main(MovieLensALS.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.storage.RddPage
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59)
	at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:52)
	at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:215)
	at MovieLensALS$.main(MovieLensALS.scala:38)
	at MovieLensALS.main(MovieLensALS.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
~~~

This can be fixed after renaming RddPage to RDDPage, or renaming RDDPage.scala to RddPage.scala. I chose the former since the name `RDD` is common in Spark code.

Author: Xiangrui Meng <meng@databricks.com>

Closes #454 from mengxr/rddpage-fix and squashes the following commits:

f75e544 [Xiangrui Meng] rename RddPage to RDDPage
2014-04-21 12:48:02 -07:00
Andrew Or af46f1fd02 [Hot Fix] Ignore org.apache.spark.ui.UISuite tests
#446 faced a connection refused exception from these tests, causing them to timeout and fail after a long time. For now, let's disable these tests.

(We recently disabled the corresponding test in streaming in 7863ecca35. These tests are very similar).

Author: Andrew Or <andrewor14@gmail.com>

Closes #466 from andrewor14/ignore-ui-tests and squashes the following commits:

6f5a362 [Andrew Or] Ignore org.apache.spark.ui.UISuite tests
2014-04-21 12:37:43 -07:00
Patrick Wendell fb98488fc8 Clean up and simplify Spark configuration
Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements:

1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file.
2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath.
3. Adds ability to set these same variables for the driver using `spark-submit`.
4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`.
5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #299 from pwendell/config-cleanup and squashes the following commits:

127f301 [Patrick Wendell] Improvements to testing
a006464 [Patrick Wendell] Moving properties file template.
b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf
0086939 [Patrick Wendell] Minor style fixes
af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs
b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide
af0adf7 [Patrick Wendell] Automatically add user jar
a56b125 [Patrick Wendell] Responses to Tom's review
d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a762901 [Patrick Wendell] Fixing test failures
ffa00fe [Patrick Wendell] Review feedback
fda0301 [Patrick Wendell] Note
308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN
e83cd8f [Patrick Wendell] Changes to allow re-use of test applications
be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set
c2a2909 [Patrick Wendell] Test compile fixes
4ee6f9d [Patrick Wendell] Making YARN doc changes consistent
afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors.
b08893b [Patrick Wendell] Additional improvements.
ace4ead [Patrick Wendell] Responses to review feedback.
b72d183 [Patrick Wendell] Review feedback for spark env file
46555c1 [Patrick Wendell] Review feedback and import clean-ups
437aed1 [Patrick Wendell] Small fix
761ebcd [Patrick Wendell] Library path and classpath for drivers
7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script
5b0ba8e [Patrick Wendell] Don't ship executor envs
84cc5e5 [Patrick Wendell] Small clean-up
1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings
4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH
6eaf7d0 [Patrick Wendell] executorJavaOpts
0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN
ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS
2014-04-21 10:26:33 -07:00
Michael Armbrust 3a390bfd80 REPL cleanup.
Author: Michael Armbrust <michael@databricks.com>

Closes #451 from marmbrus/replCleanup and squashes the following commits:

088526a [Michael Armbrust] REPL cleanup.
2014-04-19 17:33:37 -07:00
Tor Myklebust 25fc31884b [SPARK-1535] ALS: Avoid the garbage-creating ctor of DoubleMatrix
`new DoubleMatrix(double[])` creates a garbage `double[]` of the same length as its argument and immediately throws it away.  This pull request avoids that constructor in the ALS code.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #442 from tmyklebu/foo2 and squashes the following commits:

2784fc5 [Tor Myklebust] Mention that this is probably fixed as of jblas 1.2.4; repunctuate.
a09904f [Tor Myklebust] Helper function for wrapping Array[Double]'s with DoubleMatrix's.
2014-04-19 15:10:18 -07:00
Michael Armbrust 10d04213ff Add insertInto and saveAsTable to Python API.
Author: Michael Armbrust <michael@databricks.com>

Closes #447 from marmbrus/pythonInsert and squashes the following commits:

c7ab692 [Michael Armbrust] Keep docstrings < 72 chars.
ff62870 [Michael Armbrust] Add insertInto and saveAsTable to Python API.
2014-04-19 15:08:54 -07:00
Michael Armbrust 5d0f58b2eb Use scala deprecation instead of java.
This gets rid of a warning when compiling core (since we were depending on a deprecated interface with a non-deprecated function).  I also tested with javac, and this does the right thing when compiling java code.

Author: Michael Armbrust <michael@databricks.com>

Closes #452 from marmbrus/scalaDeprecation and squashes the following commits:

f628b4d [Michael Armbrust] Use scala deprecation instead of java.
2014-04-19 15:06:04 -07:00
Reynold Xin 28238c81d9 README update
Author: Reynold Xin <rxin@apache.org>

Closes #443 from rxin/readme and squashes the following commits:

16853de [Reynold Xin] Updated SBT and Scala instructions.
3ac3ceb [Reynold Xin] README update
2014-04-18 22:34:39 -07:00
zsxwing 2089e0e7e7 SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and save...
...AsNewAPIHadoopDataset

`writer.close` should be put in the `finally` block to avoid potential resource leaks.

JIRA: https://issues.apache.org/jira/browse/SPARK-1482

Author: zsxwing <zsxwing@gmail.com>

Closes #400 from zsxwing/SPARK-1482 and squashes the following commits:

06b197a [zsxwing] SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
2014-04-18 17:49:22 -07:00
Michael Armbrust c399baa0fc SPARK-1456 Remove view bounds on Ordered in favor of a context bound on Ordering.
This doesn't require creating new Ordering objects per row.  Additionally, [view bounds are going to be deprecated](https://issues.scala-lang.org/browse/SI-7629), so we should get rid of them while APIs are still flexible.

Author: Michael Armbrust <michael@databricks.com>

Closes #410 from marmbrus/viewBounds and squashes the following commits:

c574221 [Michael Armbrust] fix example.
812008e [Michael Armbrust] Update Java API.
1b9b85c [Michael Armbrust] Update scala doc.
35798a8 [Michael Armbrust] Remove view bounds on Ordered in favor of a context bound on Ordering.
2014-04-18 12:04:13 -07:00
Reynold Xin 81a152c54b Fixed broken pyspark shell.
Author: Reynold Xin <rxin@apache.org>

Closes #444 from rxin/pyspark and squashes the following commits:

fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6.
571830b [Reynold Xin] Fixed broken pyspark shell.
2014-04-18 10:10:13 -07:00
CodingCat 3c7a9bae96 SPARK-1523: improve the readability of code in AkkaUtil
Actually it is separated from https://github.com/apache/spark/pull/85 as suggested by @rxin

compare

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L122

and

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L117

the first one use get and then toLong, the second one getLong....better to make them consistent

very very small fix........

Author: CodingCat <zhunansjtu@gmail.com>

Closes #434 from CodingCat/SPARK-1523 and squashes the following commits:

0e86f3f [CodingCat] improve the readability of code in AkkaUtil
2014-04-18 10:05:00 -07:00
Sean Owen 8aa1f4c4f6 SPARK-1357 (addendum). More Experimental items in MLlib
Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much.

Author: Sean Owen <sowen@cloudera.com>

Closes #372 from srowen/SPARK-1357Addendum and squashes the following commits:

17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::"
6800e4c [Sean Owen] Remove blank line after ":: Experimental ::"
b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0
2014-04-18 10:04:02 -07:00
Xiangrui Meng aa17f022c5 [SPARK-1520] remove fastutil from dependencies
A quick fix for https://issues.apache.org/jira/browse/SPARK-1520

By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6.

With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven.

Author: Xiangrui Meng <meng@databricks.com>

Closes #437 from mengxr/remove-fastutil and squashes the following commits:

00f9beb [Xiangrui Meng] remove fastutil from dependencies
2014-04-18 10:03:15 -07:00
Cheng Lian 89f47434e2 Reuses Row object in ExistingRdd.productToRowRdd()
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #432 from liancheng/reuseRow and squashes the following commits:

9e6d083 [Cheng Lian] Simplified code with BufferedIterator
52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()
2014-04-18 10:02:27 -07:00
CodingCat e31c8ffca6 SPARK-1483: Rename minSplits to minPartitions in public APIs
https://issues.apache.org/jira/browse/SPARK-1483

From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz

Author: CodingCat <zhunansjtu@gmail.com>

Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:

4b60541 [CodingCat] deprecate defaultMinSplits
ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
2014-04-18 10:01:16 -07:00
Patrick Wendell 7863ecca35 HOTFIX: Ignore streaming UI test
This is currently causing many builds to hang.

https://issues.apache.org/jira/browse/SPARK-1530

Author: Patrick Wendell <pwendell@gmail.com>

Closes #440 from pwendell/uitest-fix and squashes the following commits:

9a143dc [Patrick Wendell] Ignore streaming UI test
2014-04-17 17:33:24 -07:00
Patrick Wendell 6c746ba3a9 FIX: Don't build Hive in assembly unless running Hive tests.
This will make the tests more stable when not running SQL tests.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #439 from pwendell/hive-tests and squashes the following commits:

88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.
2014-04-17 17:24:00 -07:00
Thomas Graves 0058b5d2c7 SPARK-1408 Modify Spark on Yarn to point to the history server when app ...
...finishes

Note this is dependent on https://github.com/apache/spark/pull/204 to have a working history server, but there are no code dependencies.

This also fixes SPARK-1288 yarn stable finishApplicationMaster incomplete. Since I was in there I made the diagnostic message be passed properly.

Author: Thomas Graves <tgraves@apache.org>

Closes #362 from tgravescs/SPARK-1408 and squashes the following commits:

ec89705 [Thomas Graves] Fix typo.
446122d [Thomas Graves] Make config yarn specific
f5d5373 [Thomas Graves] SPARK-1408 Modify Spark on Yarn to point to the history server when app finishes
2014-04-17 16:36:37 -05:00
Marcelo Vanzin 69047506bf [SPARK-1395] Allow "local:" URIs to work on Yarn.
This only works for the three paths defined in the environment
(SPARK_JAR, SPARK_YARN_APP_JAR and SPARK_LOG4J_CONF).

Tested by running SparkPi with local: and file: URIs against Yarn cluster (no "upload" shows up in logs in the local case).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #303 from vanzin/yarn-local and squashes the following commits:

82219c1 [Marcelo Vanzin] [SPARK-1395] Allow "local:" URIs to work on Yarn.
2014-04-17 10:29:38 -05:00
AbhishekKr bb76eae1b5 [python alternative] pyspark require Python2, failing if system default is Py3 from shell.py
Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py

Author: AbhishekKr <abhikumar163@gmail.com>

Closes #399 from abhishekkr/pyspark_shell and squashes the following commits:

134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py
2014-04-16 19:05:40 -07:00