Commit graph

6530 commits

Author SHA1 Message Date
Reynold Xin 353ac6b4fa Ignore RateLimitedOutputStreamSuite for now.
This test has been flaky. We can re-enable it after @tdas has a chance to look at it.

Author: Reynold Xin <rxin@apache.org>

Closes #54 from rxin/ratelimit and squashes the following commits:

1a12198 [Reynold Xin] Ignore RateLimitedOutputStreamSuite for now.
2014-03-02 14:27:19 -08:00
Aaron Davidson 46bcb9551e SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
Previously, ZooKeeperPersistenceEngine would crash the whole Master process if
there was stored data from a prior Spark version. Now, we just delete these files.

Author: Aaron Davidson <aaron@databricks.com>

Closes #4 from aarondav/zookeeper2 and squashes the following commits:

fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
2014-03-02 01:00:42 -08:00
Patrick Wendell 1fd2bfd3dd Remove remaining references to incubation
This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #51 from pwendell/tlp and squashes the following commits:

d553b1b [Patrick Wendell] Remove remaining references to incubation
2014-03-02 01:00:16 -08:00
Binh Nguyen b70823c91d Update io.netty from 4.0.13 Final to 4.0.17.Final
This update contains a lot of bug fixes and some new perf improvements.
It is also binary compatible with the current 4.0.13.Final

For more information: http://netty.io/news/2014/02/25/4-0-17-Final.html

Author: Binh Nguyen <ngbinh@gmail.com>

Author: Binh Nguyen <ngbinh@gmail.com>

Closes #41 from ngbinh/master and squashes the following commits:

a9498f4 [Binh Nguyen] update io.netty to 4.0.17.Final
2014-03-02 00:48:50 -08:00
Michael Armbrust 012bd5fbc9 Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
This allows developers to pass options (such as -D) to sbt.  I also modified the SparkBuild to ensure spark specific properties are propagated to forked test JVMs.

Author: Michael Armbrust <michael@databricks.com>

Closes #14 from marmbrus/sbtScripts and squashes the following commits:

c008b18 [Michael Armbrust] Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
2014-03-02 00:35:23 -08:00
DB Tsai 6fc76e49c1 Initialized the regVal for first iteration in SGD optimizer
Ported from https://github.com/apache/incubator-spark/pull/633

In runMiniBatchSGD, the regVal (for 1st iter) should be initialized
as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.

It maybe not be important here for SGD since the updater doesn't take the loss
as parameter to find the new weights. But it will give us the correct history of loss.
However, for LBFGS optimizer we implemented, the correct loss with regVal is crucial to
find the new weights.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #40 from dbtsai/dbtsai-smallRegValFix and squashes the following commits:

77d47da [DB Tsai] In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.
2014-03-02 00:31:59 -08:00
CodingCat 3a8b698e96 [SPARK-1100] prevent Spark from overwriting directory silently
Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100)

the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because

overwriting the data silently is not user-friendly

if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings

My fix includes:

add some new APIs with a flag for users to define whether he/she wants to overwrite the directory:
if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running;

if the flag is set to false, Spark will throw an exception if the output directory already exists

changed JavaAPI part

default behaviour is overwriting

Two questions

should we deprecate the old APIs without such a flag?

I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas

Author: CodingCat <zhunansjtu@gmail.com>

Closes #11 from CodingCat/SPARK-1100 and squashes the following commits:

6a4e3a3 [CodingCat] code clean
ef2d43f [CodingCat] add new test cases and code clean
ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat
ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
2014-03-01 17:27:54 -08:00
CodingCat fe195ae113 [SPARK-1150] fix repo location in create script (re-open)
reopen for https://spark-project.atlassian.net/browse/SPARK-1150

Author: CodingCat <zhunansjtu@gmail.com>

Closes #52 from CodingCat/script_fixes and squashes the following commits:

fc05a71 [CodingCat] fix repo location in create script
2014-03-01 17:24:53 -08:00
Patrick Wendell ec992e1822 Revert "[SPARK-1150] fix repo location in create script"
This reverts commit 9aa0957118.
2014-03-01 17:15:38 -08:00
Mark Grover 9aa0957118 [SPARK-1150] fix repo location in create script
https://spark-project.atlassian.net/browse/SPARK-1150

fix the repo location in create_release script

Author: Mark Grover <mark@apache.org>

Closes #48 from CodingCat/script_fixes and squashes the following commits:

01f4bf7 [Mark Grover] Fixing some nitpicks
d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY
2014-03-01 16:21:22 -08:00
Kay Ousterhout 556c56689b [SPARK-979] Randomize order of offers.
This commit randomizes the order of resource offers to avoid scheduling
all tasks on the same small set of machines.

This is a much simpler solution to SPARK-979 than #7.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #27 from kayousterhout/randomize and squashes the following commits:

435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.
2014-03-01 11:24:22 -08:00
Thomas Graves 4ba3f70a4e SPARK-1151: Update dev merge script to use spark.git instead of incubator-spark
Author: Thomas Graves <tgraves@apache.org>

Closes #47 from tgravescs/fix_merge_script and squashes the following commits:

8209ab1 [Thomas Graves] Update dev merge script to use spark.git instead of incubator-spark
2014-02-28 18:28:33 -08:00
Sandy Ryza 46dff34458 SPARK-1051. On YARN, executors don't doAs submitting user
This reopens https://github.com/apache/incubator-spark/pull/538 against the new repo

Author: Sandy Ryza <sandy@cloudera.com>

Closes #29 from sryza/sandy-spark-1051 and squashes the following commits:

708ce49 [Sandy Ryza] SPARK-1051. doAs submitting user in YARN
2014-02-28 12:43:01 -06:00
Sandy Ryza 5f419bf9f4 SPARK-1032. If Yarn app fails before registering, app master stays aroun...
...d long after

This reopens https://github.com/apache/incubator-spark/pull/648 against the new repo.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #28 from sryza/sandy-spark-1032 and squashes the following commits:

5953f50 [Sandy Ryza] SPARK-1032. If Yarn app fails before registering, app master stays around long after
2014-02-28 09:40:47 -06:00
Kay Ousterhout edf8a56ab7 Remote BlockFetchTracker trait
This trait seems to have been created a while ago when there
were multiple implementations; now that there's just one, I think it
makes sense to merge it into the BlockFetcherIterator trait.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #39 from kayousterhout/remove_tracker and squashes the following commits:

8173939 [Kay Ousterhout] Remote BlockFetchTracker.
2014-02-27 21:52:55 -08:00
Reynold Xin 40e080a68a Removed reference to incubation in Spark user docs.
Author: Reynold Xin <rxin@apache.org>

Closes #2 from rxin/docs and squashes the following commits:

08bbd5f [Reynold Xin] Removed reference to incubation in Spark user docs.
2014-02-27 21:13:22 -08:00
Patrick Wendell c42557be32 [HOTFIX] Patching maven build after #6 (SPARK-1121).
That patch removed the Maven avro declaration but didn't remove the
actual dependency in core. /cc @scrapcodes

Author: Patrick Wendell <pwendell@gmail.com>

Closes #37 from pwendell/master and squashes the following commits:

0ef3008 [Patrick Wendell] [HOTFIX] Patching maven build after #6 (SPARK-1121).
2014-02-27 15:06:20 -08:00
Sean Owen 12bbca2065 SPARK 1084.1 (resubmitted)
(Ported from https://github.com/apache/incubator-spark/pull/637 )

Author: Sean Owen <sowen@cloudera.com>

Closes #31 from srowen/SPARK-1084.1 and squashes the following commits:

6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it
f35b833 [Sean Owen] Fix two misc javadoc problems
254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit
5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates
007762b [Sean Owen] Remove dead scaladoc links
b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>
2014-02-27 11:12:21 -08:00
Raymond Liu aace2c097e Show Master status on UI page
For standalone HA mode, A status is useful to identify the current master, already in json format too.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #24 from colorant/status and squashes the following commits:

df630b3 [Raymond Liu] Show Master status on UI page
2014-02-26 23:51:32 -08:00
CodingCat 345df5f4a9 [SPARK-1089] fix the regression problem on ADD_JARS in 0.9
https://spark-project.atlassian.net/browse/SPARK-1089

copied from JIRA, reported by @ash211

"Using the ADD_JARS environment variable with spark-shell used to add the jar to both the shell and the various workers. Now it only adds to the workers and importing a custom class in the shell is broken.
The workaround is to add custom jars to both ADD_JARS and SPARK_CLASSPATH.
We should fix ADD_JARS so it works properly again.
See various threads on the user list:
https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201402.mbox/%3CCAJbo4neMLiTrnm1XbyqomWmp0m+EUcg4yE-txuRGSVKOb5KLeA@mail.gmail.com%3E
(another one that doesn't appear in the archives yet titled "ADD_JARS not working on 0.9")"

The reason of this bug is two-folds

in the current implementation of SparkILoop.scala, the settings.classpath is not set properly when the process() method is invoked

the weird behaviour of Scala 2.10, (I personally thought it is a bug)

if we simply set value of a PathSettings object (like settings.classpath), the isDefault is not set to true (this is a flag showing if the variable is modified), so it makes the PathResolver loads the default CLASSPATH environment variable value to calculated the path (see https://github.com/scala/scala/blob/2.10.x/src/compiler/scala/tools/util/PathResolver.scala#L215)

what we have to do is to manually make this flag set, (e3991d97dd/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala (L884))

Author: CodingCat <zhunansjtu@gmail.com>

Closes #13 from CodingCat/SPARK-1089 and squashes the following commits:

8af81e7 [CodingCat] impose non-null settings
9aa2125 [CodingCat] code cleaning
ce36676 [CodingCat] code cleaning
e045582 [CodingCat] fix the regression problem on ADD_JARS in 0.9
2014-02-26 23:42:15 -08:00
Prashant Sharma 6ccd6c55bd SPARK-1121 Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #6 from ScrapCodes/SPARK-1121/avro-dep-fix and squashes the following commits:

9b29e34 [Prashant Sharma] Review feedback on PR
46ed2ad [Prashant Sharma] SPARK-1121-Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
2014-02-26 23:40:49 -08:00
Xiangrui Meng 5a3ad107c0 SPARK-1129: use a predefined seed when seed is zero in XORShiftRandom
If the seed is zero, XORShift generates all zeros, which would create unexpected result.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1129

Author: Xiangrui Meng <meng@databricks.com>

Closes #645 from mengxr/xor and squashes the following commits:

1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom
45c6f16 [Xiangrui Meng] minor style change
51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom
2014-02-26 23:22:30 -08:00
Kay Ousterhout 71f69d66ce Remove references to ClusterScheduler (SPARK-1140)
ClusterScheduler was renamed to TaskSchedulerImpl; this commit
updates comments and tests accordingly.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits:

d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.
2014-02-26 22:52:42 -08:00
Jyotiska NK 26450351af Updated link for pyspark examples in docs
Author: Jyotiska NK <jyotiska123@gmail.com>

Closes #22 from jyotiska/pyspark_docs and squashes the following commits:

426136c [Jyotiska NK] Updated link for pyspark examples
2014-02-26 21:37:04 -08:00
Prashant Sharma 0e40e2b126 Deprecated and added a few java api methods for corresponding scala api.
PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits:

11d0c2b [Prashant Sharma] Integer -> java.lang.Integer
737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs.
3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs
2014-02-26 21:17:44 -08:00
Reynold Xin 84f7ca1381 Removed reference to incubation in README.md.
Author: Reynold Xin <rxin@apache.org>

Closes #1 from rxin/readme and squashes the following commits:

b3a77cd [Reynold Xin] Removed reference to incubation in README.md.
2014-02-26 16:52:26 -08:00
Bouke van der Bijl 12738c1aec SPARK-1115: Catch depickling errors
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason

@JoshRosen

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #644 from bouk/catch-depickling-errors and squashes the following commits:

f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
2014-02-26 14:51:21 -08:00
Matei Zaharia c86eec5843 SPARK-1135: fix broken anchors in docs
A recent PR that added Java vs Scala tabs for streaming also
inadvertently added some bad code to a document.ready handler, breaking
our other handler that manages scrolling to anchors correctly with the
floating top bar. As a result the section title ended up always being
hidden below the top bar. This removes the unnecessary JavaScript code.

Author: Matei Zaharia <matei@databricks.com>

Closes #3 from mateiz/doc-links and squashes the following commits:

e2a3488 [Matei Zaharia] SPARK-1135: fix broken anchors in docs
2014-02-26 11:20:16 -08:00
William Benton fbedc8eff2 SPARK-1078: Replace lift-json with json4s-jackson.
The aim of the Json4s project is to provide a common API for
Scala JSON libraries.  It is Apache-licensed, easier for
downstream distributions to package, and mostly API-compatible
with lift-json.  Furthermore, the Jackson-backed implementation
parses faster than lift-json on all but the smallest inputs.

Author: William Benton <willb@redhat.com>

Closes #582 from willb/json4s and squashes the following commits:

7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
2014-02-26 10:09:50 -08:00
Sandy Ryza b8a1871953 SPARK-1053. Don't require SPARK_YARN_APP_JAR
It looks this just requires taking out the checks.

I verified that, with the patch, I was able to run spark-shell through yarn without setting the environment variable.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #553 from sryza/sandy-spark-1053 and squashes the following commits:

b037676 [Sandy Ryza] SPARK-1053.  Don't require SPARK_YARN_APP_JAR
2014-02-26 10:00:02 -06:00
Raymond Liu c852201ce9 For SPARK-1082, Use Curator for ZK interaction in standalone cluster
Author: Raymond Liu <raymond.liu@intel.com>

Closes #611 from colorant/curator and squashes the following commits:

7556aa1 [Raymond Liu] Address review comments
af92e1f [Raymond Liu] Fix coding style
964f3c2 [Raymond Liu] Ignore NodeExists exception
6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
2014-02-24 23:20:38 -08:00
Semih Salihoglu 1f4c7f7ecc Graph primitives2
Hi guys,

I'm following Joey and Ankur's suggestions to add collectEdges and pickRandomVertex. I'm also adding the tests for collectEdges and refactoring one method getCycleGraph in GraphOpsSuite.scala.

Thank you,

semih

Author: Semih Salihoglu <semihsalihoglu@gmail.com>

Closes #580 from semihsalihoglu/GraphPrimitives2 and squashes the following commits:

937d3ec [Semih Salihoglu] - Fixed the scalastyle errors.
a69a152 [Semih Salihoglu] - Adding collectEdges and pickRandomVertices. - Adding tests for collectEdges. - Refactoring a getCycle utility function for GraphOpsSuite.scala.
41265a6 [Semih Salihoglu] - Adding collectEdges and pickRandomVertex. - Adding tests for collectEdges. - Recycling a getCycle utility test file.
2014-02-24 22:42:30 -08:00
Andrew Ash a4f4fbc8fa Include reference to twitter/chill in tuning docs
Author: Andrew Ash <andrew@andrewash.com>

Closes #647 from ash211/doc-tuning and squashes the following commits:

b87de0a [Andrew Ash] Include reference to twitter/chill in tuning docs
2014-02-24 21:13:38 -08:00
Bryn Keller 4d88030486 For outputformats that are Configurable, call setConf before sending data to them.
[SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured.

Note this bug also affects branch-0.9

Author: Bryn Keller <bryn.keller@intel.com>

Closes #638 from xoltar/SPARK-1108 and squashes the following commits:

7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review
7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured
2014-02-24 17:35:22 -08:00
Matei Zaharia d8d190efd2 Merge pull request #641 from mateiz/spark-1124-master
SPARK-1124: Fix infinite retries of reduce stage when a map stage failed

In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". See https://spark-project.atlassian.net/browse/SPARK-1124 for an example.

This PR also cleans up code style slightly where there was a variable named "s" and some weird map manipulation.
2014-02-24 16:58:57 -08:00
Matei Zaharia 0187cef0f2 Fix removal from shuffleToMapStage to search for a key-value pair with
our stage instead of using our shuffleID.
2014-02-24 13:14:56 -08:00
Matei Zaharia cd32d5e4de SPARK-1124: Fix infinite retries of reduce stage when a map stage failed
In the previous code, if you had a failing map stage and then tried to
run reduce stages on it repeatedly, the first reduce stage would fail
correctly, but the later ones would mistakenly believe that all map
outputs are available and start failing infinitely with fetch failures
from "null".
2014-02-23 23:48:32 -08:00
Sean Owen c0ef3afa82 SPARK-1071: Tidy logging strategy and use of log4j
Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger.

Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j.

This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should:

- Exclude dependencies on log4j, slf4j-log4j12 from Spark
- Include dependency on log4j-over-slf4j
- Include dependency on another logger X, and another slf4j-X
- Recreate any log config that Spark does, that is needed, in the other logger's config

That sounds about right.

Here are the key changes:

- Include the jcl-over-slf4j shim everywhere by depending on it in core.
- Exclude dependencies on commons-logging from third-party libraries.
- Include the jul-to-slf4j shim everywhere by depending on it in core.
- Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings
- Added missing slf4j-log4j12 binding to GraphX, Bagel module tests

And minor/incidental changes:

- Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2
- (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
- (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me)

Author: Sean Owen <sowen@cloudera.com>

Closes #570 from srowen/SPARK-1071 and squashes the following commits:

52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core.
77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
2014-02-23 11:40:55 -08:00
CodingCat 437b62fcb0 [SPARK-1041] remove dead code in start script, remind user to set that in spark-env.sh
the lines in start-master.sh and start-slave.sh no longer work

in ec2, the host name has changed, e.g.

ubuntu@ip-172-31-36-93:~$ hostname
ip-172-31-36-93

also, the URL to fetch public DNS name also changed, e.g.

ubuntu@ip-172-31-36-93:~$ wget -q -O - http://instance-data.ec2.internal/latest/meta-data/public-hostname
ubuntu@ip-172-31-36-93:~$  (returns nothing)

since we have spark-ec2 project, we don't need to have such ec2-specific lines here, instead, user only need to set in spark-env.sh

Author: CodingCat <zhunansjtu@gmail.com>

Closes #588 from CodingCat/deadcode_in_sbin and squashes the following commits:

e4236e0 [CodingCat] remove dead code in start script, remind user set that in spark-env.sh
2014-02-22 20:21:15 -08:00
Punya Biswal 29ac7ea52f Migrate Java code to Scala or move it to src/main/java
These classes can't be migrated:
  StorageLevels: impossible to create static fields in Scala
  JavaSparkContextVarargsWorkaround: incompatible varargs
  JavaAPISuite: should test Java APIs in pure Java (for sanity)

Author: Punya Biswal <pbiswal@palantir.com>

Closes #605 from punya/move-java-sources and squashes the following commits:

25b00b2 [Punya Biswal] Remove redundant type param; reformat
853da46 [Punya Biswal] Use factory method rather than constructor
e5d53d9 [Punya Biswal] Migrate Java code to Scala or move it to src/main/java
2014-02-22 17:53:48 -08:00
CodingCat 1aa4f8af72 [SPARK-1055] fix the SCALA_VERSION and SPARK_VERSION in docker file
As reported in https://spark-project.atlassian.net/browse/SPARK-1055

"The used Spark version in the .../base/Dockerfile is stale on 0.8.1 and should be updated to 0.9.x to match the release."

Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>

Closes #634 from CodingCat/SPARK-1055 and squashes the following commits:

cb7330e [Nan Zhu] Update Dockerfile
adf8259 [CodingCat] fix the SCALA_VERSION and SPARK_VERSION in docker file
2014-02-22 15:39:25 -08:00
jyotiska 722199fab0 doctest updated for mapValues, flatMapValues in rdd.py
Updated doctests for mapValues and flatMapValues in rdd.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #621 from jyotiska/python_spark and squashes the following commits:

716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
2014-02-22 15:10:31 -08:00
jyotiska 3ff077d489 Fixed minor typo in worker.py
Fixed minor typo in worker.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #630 from jyotiska/pyspark_code and squashes the following commits:

ee44201 [jyotiska] typo fixed in worker.py
2014-02-22 10:09:50 -08:00
Xiangrui Meng aaec7d4a80 SPARK-1117: update accumulator docs
The current doc hints spark doesn't support accumulators of type `Long`, which is wrong.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1117

Author: Xiangrui Meng <meng@databricks.com>

Closes #631 from mengxr/acc and squashes the following commits:

45ecd25 [Xiangrui Meng] update accumulator docs
2014-02-21 22:44:45 -08:00
Andrew Or fefd22f4c3 [SPARK-1113] External spilling - fix Int.MaxValue hash code collision bug
The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612.

ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException.

The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again.

This PR also includes two new tests for hash collisions.

Author: Andrew Or <andrewor14@gmail.com>

Closes #624 from andrewor14/spilling-bug and squashes the following commits:

9e7263d [Andrew Or] Slightly optimize next()
2037ae2 [Andrew Or] Move a few comments around...
cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash
c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap
21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite
2014-02-21 20:05:39 -08:00
Sean Owen c8a4c9b1f6 MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features
There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In `ALS.scala:214`:

```
        factors.flatMapValues{ case factorArray =>
          factorArray.map{ vector =>
            val x = new DoubleMatrix(vector)
            x.mmul(x.transpose())
          }
        }.reduceByKeyLocally((a, b) => a.addi(b))
         .values
         .reduce((a, b) => a.addi(b))
```

Completely correct, but there's a subtle but quite large memory problem here. map() is going to create all of these matrices in memory at once, when they don't need to ever all exist at the same time.
For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product requires 32GB of heap. The computation will never work unless you can cough up workers with (more than) that much heap.

Fortunately there's a trivial change that fixes it; just add `.view` in there.

Author: Sean Owen <sowen@cloudera.com>

Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits:

062cda9 [Sean Owen] Update style per review comments
e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating lots of matrices
2014-02-21 12:46:12 -08:00
Patrick Wendell 45b15e27a8 SPARK-1111: URL Validation Throws Error for HDFS URL's
Fixes an error where HDFS URL's cause an exception. Should be merged into master and 0.9.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #625 from pwendell/url-validation and squashes the following commits:

d14bfe3 [Patrick Wendell] SPARK-1111: URL Validation Throws Error for HDFS URL's
2014-02-21 11:11:55 -08:00
Ahir Reddy 59b1379594 SPARK-1114: Allow PySpark to use existing JVM and Gateway
Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:

a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
2014-02-20 21:20:39 -08:00
Aaron Davidson 3fede4831e Super minor: Add require for mergeCombiners in combineByKey
We changed the behavior in 0.9.0 from requiring that mergeCombiners be null when mapSideCombine was false to requiring that mergeCombiners *never* be null, for external sorting. This patch adds a require() to make this behavior change explicitly messaged rather than resulting in a NPE.

Author: Aaron Davidson <aaron@databricks.com>

Closes #623 from aarondav/master and squashes the following commits:

520b80c [Aaron Davidson] Super minor: Add require for mergeCombiners in combineByKey
2014-02-20 16:46:13 -08:00
Sean Owen 9e63f80e75 MLLIB-22. Support negative implicit input in ALS
I'm back with another less trivial suggestion for ALS:

In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus.

There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values.

The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1.

The implications for the algorithm are simple:
* the confidence function value must not be negative, and so can become 1 + alpha*|r|
* the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative

This in turn entails just a few lines of code change in `ALS.scala`:
* `rs(i)` becomes `abs(rs(i))`
* When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added

I think it's a safe change because:
* It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked)
* It's the simplest direct extension of the paper's algorithm
* (I've used it to good effect in production FWIW)

Tests included.

I tweaked minor things en route:
* `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt"
* RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights

Excuse my Scala style; I'm sure it needs tweaks.

Author: Sean Owen <sowen@cloudera.com>

Closes #500 from srowen/ALSNegativeImplicitInput and squashes the following commits:

cf902a9 [Sean Owen] Support negative implicit input in ALS
953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt
2014-02-19 23:44:53 -08:00