Commit graph

6313 commits

Author SHA1 Message Date
Reynold Xin f16c21e22f Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined
Replace the check for None Option with isDefined and isEmpty in Scala code

Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty.

I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand.

Pass compile and tests.
2014-01-27 14:24:06 -08:00
Sean Owen f67ce3e229 Merge pull request #460 from srowen/RandomInitialALSVectors
Choose initial user/item vectors uniformly on the unit sphere

...rather than within the unit square to possibly avoid bias in the initial state and improve convergence.

The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either.

I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past.

This is pretty minor but wanted to warm up suggesting a few tweaks to ALS.
Please excuse my Scala, pretty new to it.

Author: Sean Owen <sowen@cloudera.com>

== Merge branch commits ==

commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6
Author: Sean Owen <sowen@cloudera.com>
Date:   Mon Jan 27 08:05:25 2014 +0000

    Style: spaces around binary operators

commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e
Author: Sean Owen <sowen@cloudera.com>
Date:   Sun Jan 19 22:50:03 2014 +0000

    Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460

commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7
Author: Sean Owen <sowen@cloudera.com>
Date:   Sat Jan 18 15:54:42 2014 +0000

    Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence
2014-01-27 11:15:51 -08:00
Reynold Xin c40619d487 Merge pull request #504 from JoshRosen/SPARK-1025
Fix PySpark hang when input files are deleted (SPARK-1025)

This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.
2014-01-25 22:41:30 -08:00
Reynold Xin c66a2ef1c2 Merge pull request #511 from JoshRosen/SPARK-1040
Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)

This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException.  I applied the same fix to the Spark Streaming Java APIs.  The commit message describes the fix in more detail.

I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.
2014-01-25 22:36:07 -08:00
Josh Rosen 740e865f40 Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)
This fixes an issue where collectAsMap() could
fail when called on a JavaPairRDD that was derived
by transforming a non-JavaPairRDD.

The root problem was that we were creating the
JavaPairRDD's ClassTag by casting a
ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]].
To fix this, I cast a ClassTag[Tuple2[_, _]]
instead, since this actually produces a ClassTag
of the appropriate type because ClassTags don't
capture type parameters:

scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res8: Boolean = true

scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res9: Boolean = false
2014-01-25 16:41:12 -08:00
Josh Rosen 531d9d7576 Increase JUnit test verbosity under SBT.
Upgrade junit-interface plugin from 0.9 to 0.10.

I noticed that the JavaAPISuite tests didn't
appear to display any output locally or under
Jenkins, making it difficult to know whether they
were running.  This change increases the verbosity
to more closely match the ScalaTest tests.
2014-01-25 16:32:44 -08:00
Patrick Wendell 05be704774 Merge pull request #505 from JoshRosen/SPARK-1026
Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026)

This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs.
2014-01-23 20:53:18 -08:00
Josh Rosen 4cebb79c9f Deprecate mapPartitionsWithSplit in PySpark.
Also, replace the last reference to it in the docs.

This fixes SPARK-1026.
2014-01-23 20:01:36 -08:00
Patrick Wendell 3d6e754193 Merge pull request #503 from pwendell/master
Fix bug on read-side of external sort when using Snappy.

This case wasn't handled correctly and this patch fixes it.
2014-01-23 19:47:00 -08:00
Patrick Wendell ff44732171 Minor fix 2014-01-23 19:23:12 -08:00
Patrick Wendell c3196171f3 Merge pull request #502 from pwendell/clone-1
Remove Hadoop object cloning and warn users making Hadoop RDD's.

The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in various file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.
2014-01-23 19:11:59 -08:00
Patrick Wendell cad3002fea Merge pull request #501 from JoshRosen/cartesian-rdd-fixes
Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034

This pull request fixes two bugs in PySpark's `cartesian()` method:

- [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception
- [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result

The JIRAs have more details describing the fixes.
2014-01-23 19:08:34 -08:00
Patrick Wendell 268ecbd231 Minor changes after auditing diff from earlier version 2014-01-23 18:30:11 -08:00
Josh Rosen f83068497b Fix for SPARK-1025: PySpark hang on missing files. 2014-01-23 18:24:51 -08:00
Patrick Wendell c58d4ea3d4 Response to Matei's review 2014-01-23 18:12:40 -08:00
Patrick Wendell 0213b4032a Fix bug on read-side of external sort when using Snappy.
This case wasn't handled correctly and this patch fixes it.
2014-01-23 18:04:55 -08:00
Patrick Wendell 7101017803 Remove Hadoop object cloning and warn users making Hadoop RDD's.
The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in verious file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.
2014-01-23 17:39:23 -08:00
Josh Rosen 61569906cc Fix SPARK-978: ClassCastException in PySpark cartesian. 2014-01-23 15:09:19 -08:00
Josh Rosen 0035dbbc81 Fix SPARK-1034: Py4JException on PySpark Cartesian Result 2014-01-23 13:05:59 -08:00
Josh Rosen fad6aacfb0 Merge pull request #406 from eklavya/master
Extending Java API coverage

Hi,

I have added three new methods to JavaRDD.

Please review and merge.
2014-01-23 11:14:15 -08:00
Reynold Xin a2b47dae66 Merge pull request #499 from jianpingjwang/dev1
Replace commons-math with jblas in SVDPlusPlus
2014-01-23 10:48:26 -08:00
eklavya 60e7457266 fixed ClassTag in mapPartitions 2014-01-23 17:40:36 +05:30
Jianping J Wang 19a01c1b1d Add jblas dependency 2014-01-23 19:54:01 +08:00
Jianping J Wang a5a513e25e Add jblas dependency 2014-01-23 19:48:39 +08:00
Jianping J Wang cc0fd33177 Replace commons-math with jblas 2014-01-23 19:44:30 +08:00
Patrick Wendell a1cd185122 Merge pull request #496 from pwendell/master
Fix bug in worker clean-up in UI

Introduced in d5a96fec (/cc @aarondav).

This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
2014-01-22 19:37:29 -08:00
Patrick Wendell 034dce2a7e Merge pull request #447 from CodingCat/SPARK-1027
fix for SPARK-1027

fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)

FIXES

1. change sparkhome from String to Option(String) in ApplicationDesc

2. remove sparkhome parameter in LaunchExecutor message

3. adjust involved files
2014-01-22 18:58:02 -08:00
Patrick Wendell 6285513147 Fix bug in worker clean-up in UI
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
2014-01-22 18:19:52 -08:00
CodingCat 2b3c461451 refactor sparkHome to val
clean code
2014-01-22 20:20:46 -05:00
Patrick Wendell 3184facdc5 Merge pull request #495 from srowen/GraphXCommonsMathDependency
Fix graphx Commons Math dependency

`graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.)

The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.)

It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.
2014-01-22 15:45:04 -08:00
Sean Owen 4476398f7d Also add graphx commons-math3 dependeny in sbt build 2014-01-22 22:40:41 +00:00
Patrick Wendell a1238bb5fc Merge pull request #492 from skicavs/master
fixed job name and usage information for the JavaSparkPi example
2014-01-22 14:32:59 -08:00
Sean Owen fd0c5b8c57 Depend on Commons Math explicitly instead of accidentally getting it from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3 2014-01-22 22:25:49 +00:00
Patrick Wendell 576c4a4c50 Merge pull request #478 from sryza/sandy-spark-1033
SPARK-1033. Ask for cores in Yarn container requests

Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.
2014-01-22 14:10:07 -08:00
Matei Zaharia 5bcfd79811 Merge pull request #493 from kayousterhout/double_add
Fixed bug where task set managers are added to queue twice

@mateiz can you verify that this is a bug and wasn't intentional? (90a04dab8d (diff-7fa4f84a961750c374f2120ca70e96edR551))

This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.

Thanks to @hdc1112 for pointing this out.
2014-01-22 14:05:48 -08:00
Matei Zaharia d009b17d13 Merge pull request #315 from rezazadeh/sparsesvd
Sparse SVD

# Singular Value Decomposition
Given an *m x n* matrix *A*, compute matrices *U, S, V* such that

*A = U * S * V^T*

There is no restriction on m, but we require n^2 doubles to fit in memory.
Further, n should be less than m.

The decomposition is computed by first computing *A^TA = V S^2 V^T*,
computing svd locally on that (since n x n is small),
from which we recover S and V.
Then we compute U via easy matrix multiplication
as *U =  A * V * S^-1*

Only singular vectors associated with the largest k singular values
If there are k such values, then the dimensions of the return will be:

* *S* is *k x k* and diagonal, holding the singular values on diagonal.
* *U* is *m x k* and satisfies U^T*U = eye(k).
* *V* is *n x k* and satisfies V^TV = eye(k).

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in RDDs.

# Testing
Tests included. They test:
- Decomposition promise (A = USV^T)
- For small matrices, output is compared to that of jblas
- Rank 1 matrix test included
- Full Rank matrix test included
- Middle-rank matrix forced via k included

# Example Usage

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixyEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
      val parts = line.split(',')
      MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4

// recover top 1 singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)

println("singular values = " + decomposed.S.data.toArray.mkString)

# Documentation
Added to docs/mllib-guide.md
2014-01-22 14:01:30 -08:00
Kay Ousterhout 19da82c50f Fixed bug where task set managers are added to queue twice
This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.
2014-01-22 09:52:12 -08:00
Kevin Mader 36f9a64ec9 fixed job name and usage information for the JavaSparkPi example 2014-01-22 15:58:23 +01:00
Henry Saputra 90ea9d5a8f Replace the code to check for Option != None with Option.isDefined call in Scala code.
This hopefully will make the code cleaner.
2014-01-21 23:22:10 -08:00
Reynold Xin 749f842827 Merge pull request #489 from ash211/patch-6
Clarify spark.default.parallelism

It's the task count across the cluster, not per worker, per machine, per core, or anything else.
2014-01-21 14:53:49 -08:00
Andrew Ash 069bb94206 Clarify spark.default.parallelism
It's the task count across the cluster, not per worker, per machine, per core, or anything else.
2014-01-21 14:49:35 -08:00
Reynold Xin f8544981a6 Merge pull request #469 from ajtulloch/use-local-spark-context-in-tests-for-mllib
[MLlib] Use a LocalSparkContext trait in test suites

Replaces the 9 instances of

```scala
class XXXSuite extends FunSuite with BeforeAndAfterAll {
  @transient private var sc: SparkContext = _

  override def beforeAll() {
    sc = new SparkContext("local", "test")
  }

  override def afterAll() {
    sc.stop()
    System.clearProperty("spark.driver.port")
  }
```

with

```scala
class XXXSuite extends FunSuite with LocalSparkContext {
```
2014-01-21 10:49:54 -08:00
Andrew Tulloch 3a067b4a76 Fixed import order 2014-01-21 13:36:53 +00:00
Sandy Ryza adf42611f1 Incorporate Tom's comments - update doc and code to reflect that core requests may not always be honored 2014-01-21 00:38:02 -08:00
Patrick Wendell 77b986f661 Merge pull request #480 from pwendell/0.9-fixes
Handful of 0.9 fixes

This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate.

@mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in #477 and follow up discussion with him.
2014-01-21 00:09:42 -08:00
Patrick Wendell a9bcc980b6 Style clean-up 2014-01-21 00:05:28 -08:00
Patrick Wendell c67d3d8beb Merge pull request #484 from tdas/run-example-fix
Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM.

bin/run-example scripts was not passing Java properties set through the SPARK_JAVA_OPTS to the example. This is important for examples like Twitter** as the Twitter authentication information must be set through java properties. Hence added the same JAVA_OPTS code in run-example as it is in bin/spark-class script.

Also added SPARK_MEM, in case someone wants to run the example with different amounts of memory. This can be removed if it is not tune with the intended semantics of the run-example scripts.

@matei Please check this soon I want this to go in 0.9-rc4
2014-01-20 23:34:35 -08:00
Tathagata Das 65869f843d Removed SPARK_MEM from run-examples. 2014-01-20 23:15:28 -08:00
Patrick Wendell a917a87e02 Adding small code comment 2014-01-20 23:11:45 -08:00
Reynold Xin 6b4eed779b Merge pull request #449 from CrazyJvm/master
SPARK-1028 : fix "set MASTER automatically fails" bug.

spark-shell intends to set MASTER automatically if we do not provide the option when we start the shell , but there's a problem.
The condition is "if [[ "x" != "x$SPARK_MASTER_IP" && "y" != "y$SPARK_MASTER_PORT" ]];" we sure will set SPARK_MASTER_IP explicitly, the SPARK_MASTER_PORT option, however, we probably do not set just using spark default port 7077. So if we do not set SPARK_MASTER_PORT, the condition will never be true. We should just use default port if users do not set port explicitly I think.
2014-01-20 22:35:45 -08:00