Commit graph

6490 commits

Author SHA1 Message Date
Patrick Wendell a1cd185122 Merge pull request #496 from pwendell/master
Fix bug in worker clean-up in UI

Introduced in d5a96fec (/cc @aarondav).

This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
2014-01-22 19:37:29 -08:00
Patrick Wendell 034dce2a7e Merge pull request #447 from CodingCat/SPARK-1027
fix for SPARK-1027

fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)

FIXES

1. change sparkhome from String to Option(String) in ApplicationDesc

2. remove sparkhome parameter in LaunchExecutor message

3. adjust involved files
2014-01-22 18:58:02 -08:00
Patrick Wendell 6285513147 Fix bug in worker clean-up in UI
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
2014-01-22 18:19:52 -08:00
CodingCat 2b3c461451 refactor sparkHome to val
clean code
2014-01-22 20:20:46 -05:00
Patrick Wendell 3184facdc5 Merge pull request #495 from srowen/GraphXCommonsMathDependency
Fix graphx Commons Math dependency

`graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.)

The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.)

It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.
2014-01-22 15:45:04 -08:00
Sean Owen 4476398f7d Also add graphx commons-math3 dependeny in sbt build 2014-01-22 22:40:41 +00:00
Patrick Wendell a1238bb5fc Merge pull request #492 from skicavs/master
fixed job name and usage information for the JavaSparkPi example
2014-01-22 14:32:59 -08:00
Sean Owen fd0c5b8c57 Depend on Commons Math explicitly instead of accidentally getting it from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3 2014-01-22 22:25:49 +00:00
Patrick Wendell 576c4a4c50 Merge pull request #478 from sryza/sandy-spark-1033
SPARK-1033. Ask for cores in Yarn container requests

Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.
2014-01-22 14:10:07 -08:00
Matei Zaharia 5bcfd79811 Merge pull request #493 from kayousterhout/double_add
Fixed bug where task set managers are added to queue twice

@mateiz can you verify that this is a bug and wasn't intentional? (90a04dab8d (diff-7fa4f84a961750c374f2120ca70e96edR551))

This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.

Thanks to @hdc1112 for pointing this out.
2014-01-22 14:05:48 -08:00
Matei Zaharia d009b17d13 Merge pull request #315 from rezazadeh/sparsesvd
Sparse SVD

# Singular Value Decomposition
Given an *m x n* matrix *A*, compute matrices *U, S, V* such that

*A = U * S * V^T*

There is no restriction on m, but we require n^2 doubles to fit in memory.
Further, n should be less than m.

The decomposition is computed by first computing *A^TA = V S^2 V^T*,
computing svd locally on that (since n x n is small),
from which we recover S and V.
Then we compute U via easy matrix multiplication
as *U =  A * V * S^-1*

Only singular vectors associated with the largest k singular values
If there are k such values, then the dimensions of the return will be:

* *S* is *k x k* and diagonal, holding the singular values on diagonal.
* *U* is *m x k* and satisfies U^T*U = eye(k).
* *V* is *n x k* and satisfies V^TV = eye(k).

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in RDDs.

# Testing
Tests included. They test:
- Decomposition promise (A = USV^T)
- For small matrices, output is compared to that of jblas
- Rank 1 matrix test included
- Full Rank matrix test included
- Middle-rank matrix forced via k included

# Example Usage

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixyEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
      val parts = line.split(',')
      MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4

// recover top 1 singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)

println("singular values = " + decomposed.S.data.toArray.mkString)

# Documentation
Added to docs/mllib-guide.md
2014-01-22 14:01:30 -08:00
Kay Ousterhout 19da82c50f Fixed bug where task set managers are added to queue twice
This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.
2014-01-22 09:52:12 -08:00
Kevin Mader 36f9a64ec9 fixed job name and usage information for the JavaSparkPi example 2014-01-22 15:58:23 +01:00
Henry Saputra 90ea9d5a8f Replace the code to check for Option != None with Option.isDefined call in Scala code.
This hopefully will make the code cleaner.
2014-01-21 23:22:10 -08:00
Reynold Xin 749f842827 Merge pull request #489 from ash211/patch-6
Clarify spark.default.parallelism

It's the task count across the cluster, not per worker, per machine, per core, or anything else.
2014-01-21 14:53:49 -08:00
Andrew Ash 069bb94206 Clarify spark.default.parallelism
It's the task count across the cluster, not per worker, per machine, per core, or anything else.
2014-01-21 14:49:35 -08:00
Reynold Xin f8544981a6 Merge pull request #469 from ajtulloch/use-local-spark-context-in-tests-for-mllib
[MLlib] Use a LocalSparkContext trait in test suites

Replaces the 9 instances of

```scala
class XXXSuite extends FunSuite with BeforeAndAfterAll {
  @transient private var sc: SparkContext = _

  override def beforeAll() {
    sc = new SparkContext("local", "test")
  }

  override def afterAll() {
    sc.stop()
    System.clearProperty("spark.driver.port")
  }
```

with

```scala
class XXXSuite extends FunSuite with LocalSparkContext {
```
2014-01-21 10:49:54 -08:00
Andrew Tulloch 3a067b4a76 Fixed import order 2014-01-21 13:36:53 +00:00
Sandy Ryza adf42611f1 Incorporate Tom's comments - update doc and code to reflect that core requests may not always be honored 2014-01-21 00:38:02 -08:00
Patrick Wendell 77b986f661 Merge pull request #480 from pwendell/0.9-fixes
Handful of 0.9 fixes

This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate.

@mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in #477 and follow up discussion with him.
2014-01-21 00:09:42 -08:00
Patrick Wendell a9bcc980b6 Style clean-up 2014-01-21 00:05:28 -08:00
Patrick Wendell c67d3d8beb Merge pull request #484 from tdas/run-example-fix
Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM.

bin/run-example scripts was not passing Java properties set through the SPARK_JAVA_OPTS to the example. This is important for examples like Twitter** as the Twitter authentication information must be set through java properties. Hence added the same JAVA_OPTS code in run-example as it is in bin/spark-class script.

Also added SPARK_MEM, in case someone wants to run the example with different amounts of memory. This can be removed if it is not tune with the intended semantics of the run-example scripts.

@matei Please check this soon I want this to go in 0.9-rc4
2014-01-20 23:34:35 -08:00
Tathagata Das 65869f843d Removed SPARK_MEM from run-examples. 2014-01-20 23:15:28 -08:00
Patrick Wendell a917a87e02 Adding small code comment 2014-01-20 23:11:45 -08:00
Reynold Xin 6b4eed779b Merge pull request #449 from CrazyJvm/master
SPARK-1028 : fix "set MASTER automatically fails" bug.

spark-shell intends to set MASTER automatically if we do not provide the option when we start the shell , but there's a problem.
The condition is "if [[ "x" != "x$SPARK_MASTER_IP" && "y" != "y$SPARK_MASTER_PORT" ]];" we sure will set SPARK_MASTER_IP explicitly, the SPARK_MASTER_PORT option, however, we probably do not set just using spark default port 7077. So if we do not set SPARK_MASTER_PORT, the condition will never be true. We should just use default port if users do not set port explicitly I think.
2014-01-20 22:35:45 -08:00
Patrick Wendell 0367981d47 Merge pull request #482 from tdas/streaming-example-fix
Added StreamingContext.awaitTermination to streaming examples

StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn.

This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait.

The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9.
2014-01-20 22:25:50 -08:00
Reynold Xin 7373ffb5e7 Merge pull request #483 from pwendell/gitignore
Restricting /lib to top level directory in .gitignore

This patch was proposed by Sean Mackrory.
2014-01-20 21:44:29 -08:00
Tathagata Das e0b741d07a Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM. 2014-01-20 20:48:59 -08:00
Patrick Wendell e437069dce Restricting /lib to top level directory in .gitignore
This patch was proposed by Sean Mackrory.
2014-01-20 20:39:30 -08:00
Tathagata Das 2e95174c45 Added StreamingContext.awaitTermination to streaming examples. 2014-01-20 20:25:04 -08:00
Patrick Wendell d46df96de3 Avoid matching attempt files in the checkpoint 2014-01-20 20:03:23 -08:00
Patrick Wendell de526ad527 Remove shuffle files if they are still present on a machine. 2014-01-20 19:11:22 -08:00
Patrick Wendell f84400e86c Fixing speculation bug 2014-01-20 19:05:03 -08:00
Patrick Wendell c324ac10ee Force use of LZF when spilling data 2014-01-20 19:00:48 -08:00
Patrick Wendell 1b299142a8 Bug fix for reporting of spill output 2014-01-20 18:34:00 -08:00
Patrick Wendell 54867e9566 Minor fixes 2014-01-20 18:33:21 -08:00
Patrick Wendell cdb003e376 Removing docs on akka options 2014-01-20 16:40:58 -08:00
Sandy Ryza 3e85b87d90 SPARK-1033. Ask for cores in Yarn container requests 2014-01-20 14:42:32 -08:00
CodingCat 29f4b6a2d9 fix for SPARK-1027
change TestClient & Worker to Some("xxx")

kill manager if it is started

remove unnecessary .get when fetch "SPARK_HOME" values
2014-01-20 02:50:30 -05:00
CodingCat f9a95d6736 executor creation failed should not make the worker restart 2014-01-20 02:50:30 -05:00
Patrick Wendell 792d9084e2 Merge pull request #470 from tgravescs/fix_spark_examples_yarn
Only log error on missing jar to allow spark examples to jar.

Right now to run the spark examples on Yarn you have to use the --addJars option and put the jar in hdfs.  To make that nicer  so the user doesn't have to specify the --addJars option change it to simply log an error instead of throwing.
2014-01-19 11:33:11 -08:00
Patrick Wendell 256a3553c4 Merge pull request #458 from tdas/docs-update
Updated java API docs for streaming, along with very minor changes in the code examples.

Docs updated for:
Scala: StreamingContext, DStream, PairDStreamFunctions
Java: JavaStreamingContext, JavaDStream, JavaPairDStream

Example updated:
JavaQueueStream: Not use deprecated method
ActorWordCount: Use the public interface the right way.
2014-01-19 10:29:54 -08:00
Thomas Graves dd56b2125e update comment 2014-01-19 12:21:39 -06:00
Thomas Graves ceb79a3931 Only log error on missing jar to allow spark examples to jar. 2014-01-19 12:16:58 -06:00
Andrew Tulloch 720836a761 LocalSparkContext for MLlib 2014-01-19 17:51:00 +00:00
Yinan Li 584323c6b1 Addressed comments from Reynold
Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 21:28:17 -08:00
Patrick Wendell fe8a3546f4 Merge pull request #459 from srowen/UpdaterL2Regularization
Correct L2 regularized weight update with canonical form

Per thread on the user@ mailing list, and comments from Ameet, I believe the weight update for L2 regularization needs to be corrected. See http://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAH3_EVMetuQuhj3__NdUniDLc4P-FMmmrmxw9TS14or8nT4BNQ%40mail.gmail.com%3E
2014-01-18 16:29:23 -08:00
Patrick Wendell 73dfd42fba Merge pull request #437 from mridulm/master
Minor api usability changes

- Expose checkpoint directory - since it is autogenerated now
- null check for jars
- Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark.
2014-01-18 16:23:56 -08:00
Patrick Wendell 4c16f79ce4 Merge pull request #426 from mateiz/py-ml-tests
Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+)

We disabled these earlier because Jenkins didn't have these versions.
2014-01-18 16:21:43 -08:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00