Commit graph

2826 commits

Author SHA1 Message Date
Andrew Ash c0795cf481 Worker registration logging fix
Author: Andrew Ash <andrew@andrewash.com>

Closes #608 from ash211/patch-7 and squashes the following commits:

bd85f2a [Andrew Ash] Worker registration logging fix
2014-02-17 09:51:55 -08:00
Punya Biswal 5af4477c2b Add subtractByKey to the JavaPairRDD wrapper
Author: Punya Biswal <pbiswal@palantir.com>

Closes #600 from punya/subtractByKey-java and squashes the following commits:

e961913 [Punya Biswal] Hide implicit ClassTags from Java API
c5d317b [Punya Biswal] Add subtractByKey to the JavaPairRDD wrapper
2014-02-16 18:55:59 -08:00
Bijay Bisht 73cfdcfe71 fix for https://spark-project.atlassian.net/browse/SPARK-1052
Author: Bijay Bisht <bijay.bisht@gmail.com>

Closes #568 from bijaybisht/SPARK-1052 and squashes the following commits:

da70395 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 - comments incorporated
fdb1d94 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052

(cherry picked from commit e797c1abd9)
Signed-off-by: Aaron Davidson <aaron@databricks.com>
2014-02-16 16:54:03 -08:00
CodingCat 1cad381387 [SPARK-1092] print warning information if user use SPARK_MEM to regulate executor memory usage
https://spark-project.atlassian.net/browse/SPARK-1092?jql=project%20%3D%20SPARK

print warning information if user set SPARK_MEM to regulate memory usage of executors

----
OUTDATED:

Currently, users will usually set SPARK_MEM to control the memory usage of driver programs, (in spark-class)
91 JAVA_OPTS="$OUR_JAVA_OPTS"
92 JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
93 JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"
if they didn't set spark.executor.memory, the value in this environment variable will also affect the memory usage of executors, because the following lines in SparkContext
privatespark val executorMemory = conf.getOption("spark.executor.memory")
.orElse(Option(System.getenv("SPARK_MEM")))
.map(Utils.memoryStringToMb)
.getOrElse(512)
also
since SPARK_MEM has been (proposed to) deprecated in SPARK-929 (https://spark-project.atlassian.net/browse/SPARK-929) and the corresponding PR (https://github.com/apache/incubator-spark/pull/104)
we should remove this line

Author: CodingCat <zhunansjtu@gmail.com>

Closes #602 from CodingCat/clean_spark_mem and squashes the following commits:

302bb28 [CodingCat] print warning information if user use SPARK_MEM to regulate executor memory usage
2014-02-16 12:25:38 -08:00
Xiangrui Meng 7e29e02791 Merge pull request #591 from mengxr/transient-new.
SPARK-1076: [Fix #578] add @transient to some vals

I'll try to be more careful next time.

Author: Xiangrui Meng <meng@databricks.com>

Closes #591 and squashes the following commits:

2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD
2014-02-12 16:26:25 -08:00
Xiangrui Meng 2bea0709f9 Merge pull request #589 from mengxr/index.
SPARK-1076: Convert Int to Long to avoid overflow

Patch for PR #578.

Author: Xiangrui Meng <meng@databricks.com>

Closes #589 and squashes the following commits:

98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow
2014-02-12 10:47:52 -08:00
Xiangrui Meng e733d655df Merge pull request #578 from mengxr/rank.
SPARK-1076: zipWithIndex and zipWithUniqueId to RDD

Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel.

The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job.

https://spark-project.atlassian.net/browse/SPARK-1076

== update ==
Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field.

Author: Xiangrui Meng <meng@databricks.com>

Closes #578 and squashes the following commits:

52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates
756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition
630868c [Xiangrui Meng] newline
21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
2014-02-12 00:42:42 -08:00
Raymond Liu 68b2c0d02d Merge pull request #583 from colorant/zookeeper.
Minor fix for ZooKeeperPersistenceEngine to use configured working dir

Author: Raymond Liu <raymond.liu@intel.com>

Closes #583 and squashes the following commits:

91b0609 [Raymond Liu] Minor fix for ZooKeeperPersistenceEngine to use configured working dir
2014-02-11 22:39:48 -08:00
Holden Karau b0dab1bb9f Merge pull request #571 from holdenk/switchtobinarysearch.
SPARK-1072 Use binary search when needed in RangePartioner

Author: Holden Karau <holden@pigscanfly.ca>

Closes #571 and squashes the following commits:

f31a2e1 [Holden Karau] Swith to using CollectionsUtils in Partitioner
4c7a0c3 [Holden Karau] Add CollectionsUtil as suggested by aarondav
7099962 [Holden Karau] Add the binary search to only init once
1bef01d [Holden Karau] CR feedback
a21e097 [Holden Karau] Use binary search if we have more than 1000 elements inside of RangePartitioner
2014-02-11 14:48:59 -08:00
Patrick Wendell d6a9bdc097 Revert "Merge pull request #560 from pwendell/logging. Closes #560."
This reverts commit b6d40b7823.
2014-02-09 23:35:06 -08:00
Prashant Sharma 919bd7f669 Merge pull request #567 from ScrapCodes/style2.
SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2

Continuation of PR #557

With this all scala style errors are fixed across the code base !!

The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #567 and squashes the following commits:

3b1ec30 [Prashant Sharma] scala style fixes
2014-02-09 22:17:52 -08:00
qqsun8819 afc8f3cb9a Merge pull request #551 from qqsun8819/json-protocol.
[SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself

This is a PR for SPARK-1038. Two major changes:
1 add some fields to JsonProtocol which is new and important to standalone-related data structures
2 Use Diff in liftweb.json to verity the stringified Json output for detecting someone mod type T to Option[T]

Author: qqsun8819 <jin.oyj@alibaba-inc.com>

Closes #551 and squashes the following commits:

fdf0b4e [qqsun8819] [SPARK-1038] 1. Change code style for more readable according to rxin review 2. change submitdate hard-coded string to a date object toString for more complexiblity
095a26f [qqsun8819] [SPARK-1038] mod according to  review of pwendel, use hard-coded json string for json data validation. Each test use its own json string
0524e41 [qqsun8819] Merge remote-tracking branch 'upstream/master' into json-protocol
d203d5c [qqsun8819] [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself
2014-02-09 13:57:29 -08:00
Patrick Wendell b69f8b2a01 Merge pull request #557 from ScrapCodes/style. Closes #557.
SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

== Merge branch commits ==

commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4
Author: Prashant Sharma <scrapcodes@gmail.com>
Date:   Sun Feb 9 17:39:07 2014 +0530

    scala style fixes

commit f91709887a8e0b608c5c2b282db19b8a44d53a43
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Jan 24 11:22:53 2014 -0800

    Adding scalastyle snapshot
2014-02-09 10:09:19 -08:00
CodingCat b6dba10ae5 Merge pull request #556 from CodingCat/JettyUtil. Closes #556.
[SPARK-1060] startJettyServer should explicitly use IP information

https://spark-project.atlassian.net/browse/SPARK-1060

In the current implementation, the webserver in Master/Worker is started with

val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers)

inside startJettyServer:

val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC

this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address

Author: CodingCat <zhunansjtu@gmail.com>

== Merge branch commits ==

commit 6c6d9a8ccc9ec4590678a3b34cb03df19092029d
Author: CodingCat <zhunansjtu@gmail.com>
Date:   Thu Feb 6 14:53:34 2014 -0500

    startJettyServer should explicitly use IP information
2014-02-08 23:39:17 -08:00
Patrick Wendell b6d40b7823 Merge pull request #560 from pwendell/logging. Closes #560.
[WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j

To fix this - we add a check when initializing log4j.

Author: Patrick Wendell <pwendell@gmail.com>

== Merge branch commits ==

commit ffdce513877f64b6eed6d36138c3e0003d392889
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Feb 7 15:22:29 2014 -0800

    Logging fix
2014-02-08 23:35:31 -08:00
Qiuzhuang Lian f0ce736fad Merge pull request #561 from Qiuzhuang/master. Closes #561.
Kill drivers in postStop() for Worker.

 JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068

Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com>

== Merge branch commits ==

commit 9c19ce63637eee9369edd235979288d3d9fc9105
Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com>
Date:   Sat Feb 8 16:07:39 2014 +0800

    Kill drivers in postStop() for Worker.
     JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068
2014-02-08 12:59:48 -08:00
Andrew Ash 3a9d82cc9e Merge pull request #506 from ash211/intersection. Closes #506.
SPARK-1062 Add rdd.intersection(otherRdd) method

Author: Andrew Ash <andrew@andrewash.com>

== Merge branch commits ==

commit 5d9982b171b9572649e9828f37ef0b43f0242912
Author: Andrew Ash <andrew@andrewash.com>
Date:   Thu Feb 6 18:11:45 2014 -0800

    Minor fixes

    - style: (v,null) => (v, null)
    - mention the shuffle in Javadoc

commit b86d02f14e810902719cef893cf6bfa18ff9acb0
Author: Andrew Ash <andrew@andrewash.com>
Date:   Sun Feb 2 13:17:40 2014 -0800

    Overload .intersection() for numPartitions and custom Partitioner

commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09
Author: Andrew Ash <andrew@andrewash.com>
Date:   Sun Feb 2 13:05:40 2014 -0800

    Better naming of parameters in intersection's filter

commit b10a6af2d793ec6e9a06c798007fac3f6b860d89
Author: Andrew Ash <andrew@andrewash.com>
Date:   Sat Jan 25 23:06:26 2014 -0800

    Follow spark code format conventions of tab => 2 spaces

commit 965256e4304cca514bb36a1a36087711dec535ec
Author: Andrew Ash <andrew@andrewash.com>
Date:   Fri Jan 24 00:28:01 2014 -0800

    Add rdd.intersection(otherRdd) method
2014-02-06 22:39:08 -08:00
Andrew Or 1896c6e7c9 Merge pull request #533 from andrewor14/master. Closes #533.
External spilling - generalize batching logic

The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way.

Author: Andrew Or <andrewor14@gmail.com>

== Merge branch commits ==

commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82
Author: Andrew Or <andrewor14@gmail.com>
Date:   Wed Feb 5 12:09:32 2014 -0800

    Also privatize fields

commit 090544a87a0767effd0c835a53952f72fc8d24f0
Author: Andrew Or <andrewor14@gmail.com>
Date:   Wed Feb 5 10:58:23 2014 -0800

    Privatize methods

commit 13920c918efe22e66a1760b14beceb17a61fd8cc
Author: Andrew Or <andrewor14@gmail.com>
Date:   Tue Feb 4 16:34:15 2014 -0800

    Update docs

commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3
Author: Andrew Or <andrewor14@gmail.com>
Date:   Tue Feb 4 13:44:24 2014 -0800

    Typo: phyiscal -> physical

commit 287ef44e593ad72f7434b759be3170d9ee2723d2
Author: Andrew Or <andrewor14@gmail.com>
Date:   Tue Feb 4 13:38:32 2014 -0800

    Avoid reading the entire batch into memory; also simplify streaming logic

    Additionally, address formatting comments.

commit 3df700509955f7074821e9aab1e74cb53c58b5a5
Merge: a531d2e 164489d
Author: Andrew Or <andrewor14@gmail.com>
Date:   Mon Feb 3 18:27:49 2014 -0800

    Merge branch 'master' of github.com:andrewor14/incubator-spark

commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8
Author: Andrew Or <andrewor14@gmail.com>
Date:   Mon Feb 3 18:18:04 2014 -0800

    Relax assumptions on compressors and serializers when batching

    This commit introduces an intermediate layer of an input stream on the batch level.
    This guards against interference from higher level streams (i.e. compression and
    deserialization streams), especially pre-fetching, without specifically targeting
    particular libraries (Kryo) and forcing shuffle spill compression to use LZF.

commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af
Author: Andrew Or <andrewor14@gmail.com>
Date:   Mon Feb 3 18:18:04 2014 -0800

    Relax assumptions on compressors and serializers when batching

    This commit introduces an intermediate layer of an input stream on the batch level.
    This guards against interference from higher level streams (i.e. compression and
    deserialization streams), especially pre-fetching, without specifically targeting
    particular libraries (Kryo) and forcing shuffle spill compression to use LZF.
2014-02-06 22:05:53 -08:00
Kay Ousterhout 0b448df6ac Merge pull request #450 from kayousterhout/fetch_failures. Closes #450.
Only run ResubmitFailedStages event after a fetch fails

Previously, the ResubmitFailedStages event was called every
200 milliseconds, leading to a lot of unnecessary event processing
and clogged DAGScheduler logs.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit e603784b3a562980e6f1863845097effe2129d3b
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Feb 5 11:34:41 2014 -0800

    Re-add check for empty set of failed stages

commit d258f0ef50caff4bbb19fb95a6b82186db1935bf
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Jan 15 23:35:41 2014 -0800

    Only run ResubmitFailedStages event after a fetch fails

    Previously, the ResubmitFailedStages event was called every
    200 milliseconds, leading to a lot of unnecessary event processing
    and clogged DAGScheduler logs.
2014-02-06 16:15:24 -08:00
Kay Ousterhout 18ad59e2c6 Merge pull request #321 from kayousterhout/ui_kill_fix. Closes #321.
Inform DAG scheduler about all started/finished tasks.

Previously, the DAG scheduler was not always informed
when tasks started and finished. The simplest example here
is for speculated tasks: the DAGScheduler was only told about
the first attempt of a task, meaning that SparkListeners were
also not told about multiple task attempts, so users can't see
what's going on with speculation in the UI.  The DAGScheduler
also wasn't always told about finished tasks, so in the UI, some
tasks will never be shown as finished (this occurs, for example,
if a task set gets killed).

The other problem is that the fairness accounting was wrong
-- the number of running tasks in a pool was decreased when a
task set was considered done, even if all of its tasks hadn't
yet finished.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit c8d547d0f7a17f5a193bef05f5872b9f475675c5
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Jan 15 16:47:33 2014 -0800

    Addressed Reynold's review comments.

    Always use a TaskEndReason (remove the option), and explicitly
    signal when we don't know the reason. Also, always tell
    DAGScheduler (and associated listeners) about started tasks, even
    when they're speculated.

commit 3fee1e2e3c06b975ff7f95d595448f38cce97a04
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Jan 8 22:58:13 2014 -0800

    Fixed broken test and improved logging

commit ff12fcaa2567c5d02b75a1d5db35687225bcd46f
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Sun Dec 29 21:08:20 2013 -0800

    Inform DAG scheduler about all finished tasks.

    Previously, the DAG scheduler was not always informed
    when tasks finished. For example, when a task set was
    aborted, the DAG scheduler was never told when the tasks
    in that task set finished. The DAG scheduler was also
    never told about the completion of speculated tasks.
    This led to confusion with SparkListeners because information
    about the completion of those tasks was never passed on to
    the listeners (so in the UI, for example, some tasks will never
    be shown as finished).

    The other problem is that the fairness accounting was wrong
    -- the number of running tasks in a pool was decreased when a
    task set was considered done, even if all of its tasks hadn't
    yet finished.
2014-02-06 16:10:48 -08:00
Sandy Ryza 446403b637 Merge pull request #554 from sryza/sandy-spark-1056. Closes #554.
SPARK-1056. Fix header comment in Executor to not imply that it's only u...

...sed for Mesos and Standalone.

Author: Sandy Ryza <sandy@cloudera.com>

== Merge branch commits ==

commit 1f2443d902a26365a5c23e4af9077e1539ed2eab
Author: Sandy Ryza <sandy@cloudera.com>
Date:   Thu Feb 6 15:03:50 2014 -0800

    SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone
2014-02-06 15:41:16 -08:00
Kay Ousterhout 79c95527a7 Merge pull request #545 from kayousterhout/fix_progress. Closes #545.
Fix off-by-one error with task progress info log.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit 29798fc685c4e7e3eb3bf91c75df7fa8ec94a235
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Feb 5 13:40:01 2014 -0800

    Fix off-by-one error with task progress info log.
2014-02-05 23:38:12 -08:00
CodingCat 18c4ee71e2 Merge pull request #549 from CodingCat/deadcode_master. Closes #549.
remove actorToWorker in master.scala, which is actually not used

actorToWorker is actually not used in the code....just remove it

Author: CodingCat <zhunansjtu@gmail.com>

== Merge branch commits ==

commit 52656c2d4bbf9abcd8bef65d454badb9cb14a32c
Author: CodingCat <zhunansjtu@gmail.com>
Date:   Thu Feb 6 00:28:26 2014 -0500

    remove actorToWorker in master.scala, which is actually not used
2014-02-05 22:08:47 -08:00
Stevo Slavić 0c05cd374d Merge pull request #535 from sslavic/patch-2. Closes #535.
Fixed typo in scaladoc

Author: Stevo Slavić <sslavic@gmail.com>

== Merge branch commits ==

commit 0a77f789e281930f4168543cc0d3b3ffbf5b3764
Author: Stevo Slavić <sslavic@gmail.com>
Date:   Tue Feb 4 15:30:27 2014 +0100

    Fixed typo in scaladoc
2014-02-04 09:45:46 -08:00
Xiangrui Meng 23af00f9e0 Merge pull request #528 from mengxr/sample. Closes #528.
Refactor RDD sampling and add randomSplit to RDD (update)

Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are:

1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513
2) Stratified sampling and importance sampling can be implemented in the same manner as well.

Unit tests are included for samplers and RDD.randomSplit.

This should performance better than my previous request where the BernoulliSampler creates many Iterator instances:
https://github.com/apache/incubator-spark/pull/513

Author: Xiangrui Meng <meng@databricks.com>

== Merge branch commits ==

commit e8ce957e5f0a600f2dec057924f4a2ca6adba373
Author: Xiangrui Meng <meng@databricks.com>
Date:   Mon Feb 3 12:21:08 2014 -0800

    more docs to PartitionwiseSampledRDD

commit fbb4586d0478ff638b24bce95f75ff06f713d43b
Author: Xiangrui Meng <meng@databricks.com>
Date:   Mon Feb 3 00:44:23 2014 -0800

    move XORShiftRandom to util.random and use it in BernoulliSampler

commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 11:06:59 2014 -0800

    relax assertions in SortingSuite because the RangePartitioner has large variance in this case

commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 09:56:28 2014 -0800

    test split ratio of RDD.randomSplit

commit 8a410bc933a60c4d63852606f8bbc812e416d6ae
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 09:25:22 2014 -0800

    add a test to ensure seed distribution and minor style update

commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 18:06:22 2014 -0800

    minor style change

commit 750912b4d77596ed807d361347bd2b7e3b9b7a74
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 18:04:54 2014 -0800

    fix some long lines

commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 17:59:59 2014 -0800

    add complement to BernoulliSampler and minor style changes

commit dbe2bc2bd888a7bdccb127ee6595840274499403
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 17:45:08 2014 -0800

    switch to partition-wise sampling for better performance

commit a1fca5232308feb369339eac67864c787455bb23
Merge: ac712e4 cf6128f
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 16:33:09 2014 -0800

    Merge branch 'sample' of github.com:mengxr/incubator-spark into sample

commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 14:40:07 2014 -0800

    set SampledRDD deprecated in 1.0

commit f430f847c3df91a3894687c513f23f823f77c255
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 14:38:59 2014 -0800

    update code style

commit a8b5e2021a9204e318c80a44d00c5c495f1befb6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 12:56:27 2014 -0800

    move package random to util.random

commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 12:50:35 2014 -0800

    add Apache headers and update code style

commit 985609fe1a55655ad11966e05a93c18c138a403d
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 11:49:25 2014 -0800

    add new lines

commit b21bddf29850a2c006a868869b8f91960a029322
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 11:46:35 2014 -0800

    move samplers to random.IndependentRandomSampler and add tests

commit c02dacb4a941618e434cefc129c002915db08be6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Jan 25 15:20:24 2014 -0800

    add RandomSampler

commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 24 13:23:22 2014 -0800

    init impl of IndependentlySampledRDD
2014-02-03 13:02:09 -08:00
Aaron Davidson 1625d8c446 Merge pull request #530 from aarondav/cleanup. Closes #530.
Remove explicit conversion to PairRDDFunctions in cogroup()

As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the conversion explicitly, however.

Author: Aaron Davidson <aaron@databricks.com>

== Merge branch commits ==

commit aa4a63f1bfd5b5178fe67364dd7ce4d84c357996
Author: Aaron Davidson <aaron@databricks.com>
Date:   Sun Feb 2 23:48:04 2014 -0800

    Remove explicit conversion to PairRDDFunctions in cogroup()

    As SparkContext._ is already imported, using the implicit conversion
    appears to make the code much cleaner. Perhaps there was some sinister
    reason for doing the converion explicitly, however.
2014-02-03 11:25:39 -08:00
Erik Selin 0ff38c2220 Merge pull request #494 from tyro89/worker_registration_issue
Issue with failed worker registrations

I've been going through the spark source after having some odd issues with workers dying and not coming back. After some digging (I'm very new to scala and spark) I believe I've found a worker registration issue. It looks to me like a failed registration follows the same code path as a successful registration which end up with workers believing they are connected (since they received a `RegisteredWorker` event) even tho they are not registered on the Master.

This is a quick fix that I hope addresses this issue (assuming I didn't completely miss-read the code and I'm about to look like a silly person :P)

I'm opening this pr now to start a chat with you guys while I do some more testing on my side :)

Author: Erik Selin <erik.selin@jadedpixel.com>

== Merge branch commits ==

commit 973012f8a2dcf1ac1e68a69a2086a1b9a50f401b
Author: Erik Selin <erik.selin@jadedpixel.com>
Date:   Tue Jan 28 23:36:12 2014 -0500

    break logwarning into two lines to respect line character limit.

commit e3754dc5b94730f37e9806974340e6dd93400f85
Author: Erik Selin <erik.selin@jadedpixel.com>
Date:   Tue Jan 28 21:16:21 2014 -0500

    add log warning when worker registration fails due to attempt to re-register on same address.

commit 14baca241fa7823e1213cfc12a3ff2a9b865b1ed
Author: Erik Selin <erik.selin@jadedpixel.com>
Date:   Wed Jan 22 21:23:26 2014 -0500

    address code style comment

commit 71c0d7e6f59cd378d4e24994c21140ab893954ee
Author: Erik Selin <erik.selin@jadedpixel.com>
Date:   Wed Jan 22 16:01:42 2014 -0500

    Make a failed registration not persist, not send a `RegisteredWordker` event and not run `schedule` but rather send a `RegisterWorkerFailed` message to the worker attempting to register.
2014-01-29 12:44:54 -08:00
Josh Rosen 1381fc72f7 Switch from MUTF8 to UTF8 in PySpark serializers.
This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.
2014-01-28 20:20:08 -08:00
Reynold Xin 84670f2715 Merge pull request #466 from liyinan926/file-overwrite-new
Allow files added through SparkContext.addFile() to be overwritten

This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.
2014-01-27 17:08:35 -08:00
Reynold Xin f16c21e22f Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined
Replace the check for None Option with isDefined and isEmpty in Scala code

Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty.

I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand.

Pass compile and tests.
2014-01-27 14:24:06 -08:00
Reynold Xin c40619d487 Merge pull request #504 from JoshRosen/SPARK-1025
Fix PySpark hang when input files are deleted (SPARK-1025)

This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.
2014-01-25 22:41:30 -08:00
Josh Rosen 740e865f40 Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)
This fixes an issue where collectAsMap() could
fail when called on a JavaPairRDD that was derived
by transforming a non-JavaPairRDD.

The root problem was that we were creating the
JavaPairRDD's ClassTag by casting a
ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]].
To fix this, I cast a ClassTag[Tuple2[_, _]]
instead, since this actually produces a ClassTag
of the appropriate type because ClassTags don't
capture type parameters:

scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res8: Boolean = true

scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res9: Boolean = false
2014-01-25 16:41:12 -08:00
Patrick Wendell 3d6e754193 Merge pull request #503 from pwendell/master
Fix bug on read-side of external sort when using Snappy.

This case wasn't handled correctly and this patch fixes it.
2014-01-23 19:47:00 -08:00
Patrick Wendell ff44732171 Minor fix 2014-01-23 19:23:12 -08:00
Patrick Wendell c3196171f3 Merge pull request #502 from pwendell/clone-1
Remove Hadoop object cloning and warn users making Hadoop RDD's.

The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in various file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.
2014-01-23 19:11:59 -08:00
Patrick Wendell cad3002fea Merge pull request #501 from JoshRosen/cartesian-rdd-fixes
Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034

This pull request fixes two bugs in PySpark's `cartesian()` method:

- [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception
- [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result

The JIRAs have more details describing the fixes.
2014-01-23 19:08:34 -08:00
Patrick Wendell 268ecbd231 Minor changes after auditing diff from earlier version 2014-01-23 18:30:11 -08:00
Josh Rosen f83068497b Fix for SPARK-1025: PySpark hang on missing files. 2014-01-23 18:24:51 -08:00
Patrick Wendell c58d4ea3d4 Response to Matei's review 2014-01-23 18:12:40 -08:00
Patrick Wendell 0213b4032a Fix bug on read-side of external sort when using Snappy.
This case wasn't handled correctly and this patch fixes it.
2014-01-23 18:04:55 -08:00
Patrick Wendell 7101017803 Remove Hadoop object cloning and warn users making Hadoop RDD's.
The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in verious file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.
2014-01-23 17:39:23 -08:00
Josh Rosen 61569906cc Fix SPARK-978: ClassCastException in PySpark cartesian. 2014-01-23 15:09:19 -08:00
Josh Rosen 0035dbbc81 Fix SPARK-1034: Py4JException on PySpark Cartesian Result 2014-01-23 13:05:59 -08:00
Josh Rosen fad6aacfb0 Merge pull request #406 from eklavya/master
Extending Java API coverage

Hi,

I have added three new methods to JavaRDD.

Please review and merge.
2014-01-23 11:14:15 -08:00
eklavya 60e7457266 fixed ClassTag in mapPartitions 2014-01-23 17:40:36 +05:30
Patrick Wendell a1cd185122 Merge pull request #496 from pwendell/master
Fix bug in worker clean-up in UI

Introduced in d5a96fec (/cc @aarondav).

This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.
2014-01-22 19:37:29 -08:00
Patrick Wendell 034dce2a7e Merge pull request #447 from CodingCat/SPARK-1027
fix for SPARK-1027

fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)

FIXES

1. change sparkhome from String to Option(String) in ApplicationDesc

2. remove sparkhome parameter in LaunchExecutor message

3. adjust involved files
2014-01-22 18:58:02 -08:00
Patrick Wendell 6285513147 Fix bug in worker clean-up in UI
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
2014-01-22 18:19:52 -08:00
CodingCat 2b3c461451 refactor sparkHome to val
clean code
2014-01-22 20:20:46 -05:00
Kay Ousterhout 19da82c50f Fixed bug where task set managers are added to queue twice
This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.
2014-01-22 09:52:12 -08:00
Henry Saputra 90ea9d5a8f Replace the code to check for Option != None with Option.isDefined call in Scala code.
This hopefully will make the code cleaner.
2014-01-21 23:22:10 -08:00
Patrick Wendell a9bcc980b6 Style clean-up 2014-01-21 00:05:28 -08:00
Patrick Wendell a917a87e02 Adding small code comment 2014-01-20 23:11:45 -08:00
Patrick Wendell d46df96de3 Avoid matching attempt files in the checkpoint 2014-01-20 20:03:23 -08:00
Patrick Wendell de526ad527 Remove shuffle files if they are still present on a machine. 2014-01-20 19:11:22 -08:00
Patrick Wendell f84400e86c Fixing speculation bug 2014-01-20 19:05:03 -08:00
Patrick Wendell c324ac10ee Force use of LZF when spilling data 2014-01-20 19:00:48 -08:00
Patrick Wendell 1b299142a8 Bug fix for reporting of spill output 2014-01-20 18:34:00 -08:00
Patrick Wendell 54867e9566 Minor fixes 2014-01-20 18:33:21 -08:00
Patrick Wendell cdb003e376 Removing docs on akka options 2014-01-20 16:40:58 -08:00
CodingCat 29f4b6a2d9 fix for SPARK-1027
change TestClient & Worker to Some("xxx")

kill manager if it is started

remove unnecessary .get when fetch "SPARK_HOME" values
2014-01-20 02:50:30 -05:00
CodingCat f9a95d6736 executor creation failed should not make the worker restart 2014-01-20 02:50:30 -05:00
Thomas Graves dd56b2125e update comment 2014-01-19 12:21:39 -06:00
Thomas Graves ceb79a3931 Only log error on missing jar to allow spark examples to jar. 2014-01-19 12:16:58 -06:00
Yinan Li 584323c6b1 Addressed comments from Reynold
Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 21:28:17 -08:00
Patrick Wendell 73dfd42fba Merge pull request #437 from mridulm/master
Minor api usability changes

- Expose checkpoint directory - since it is autogenerated now
- null check for jars
- Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark.
2014-01-18 16:23:56 -08:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00
Yinan Li fd833e7ab1 Allow files added through SparkContext.addFile() to be overwritten
This is useful for the cases when a file needs to be refreshed and downloaded
by the executors periodically.

Signed-off-by: Yinan Li <liyinan926@gmail.com>
2014-01-18 15:26:59 -08:00
Patrick Wendell 5316bcac3c Use renamed shuffle spill config in CoGroupedRDD.scala 2014-01-18 11:58:42 -08:00
Mridul Muralidharan b690e11d9c Address review comment 2014-01-17 18:28:55 +05:30
Patrick Wendell d4fd89e3c8 Merge pull request #438 from ScrapCodes/clone-records-java-api
Clone records java api
2014-01-16 23:17:30 -08:00
Prashant Sharma fcb4fc653d adding clone records field to equivaled java apis 2014-01-17 11:16:03 +05:30
Mridul Muralidharan edd82c58a2 Use method, not variable 2014-01-16 17:26:42 +05:30
Mridul Muralidharan 1a0da89277 Address review comments 2014-01-16 17:23:25 +05:30
Reynold Xin c06a307ca2 Merge pull request #445 from kayousterhout/exec_lost
Fail rather than hanging if a task crashes the JVM.

Prior to this commit, if a task crashes the JVM, the task (and
all other tasks running on that executor) is marked at KILLED rather
than FAILED.  As a result, the TaskSetManager will retry the task
indefinitely rather than failing the job after maxFailures. Eventually,
this makes the job hang, because the Standalone Scheduler removes
the application after 10 works have failed, and then the app is left
in a state where it's disconnected from the master and waiting to reconnect.
This commit fixes that problem by marking tasks as FAILED rather than
killed when an executor is lost.

The downside of this commit is that if task A fails because another
task running on the same executor caused the VM to crash, the failure
will incorrectly be counted as a failure of task A. This should not
be an issue because we typically set maxFailures to 3, and it is
unlikely that a task will be co-located with a JVM-crashing task
multiple times.
2014-01-15 23:47:25 -08:00
Kay Ousterhout a268d63411 Fail rather than hanging if a task crashes the JVM.
Prior to this commit, if a task crashes the JVM, the task (and
all other tasks running on that executor) is marked at KILLED rather
than FAILED.  As a result, the TaskSetManager will retry the task
indefiniteily rather than failing the job after maxFailures. This
commit fixes that problem by marking tasks as FAILED rather than
killed when an executor is lost.

The downside of this commit is that if task A fails because another
task running on the same executor caused the VM to crash, the failure
will incorrectly be counted as a failure of task A. This should not
be an issue because we typically set maxFailures to 3, and it is
unlikely that a task will be co-located with a JVM-crashing task
multiple times.
2014-01-15 16:03:40 -08:00
Patrick Wendell 59f475c79f Merge pull request #442 from pwendell/standalone
Workers should use working directory as spark home if it's not specified

If users don't set SPARK_HOME in their environment file when launching an application, the standalone cluster should default to the spark home of the worker.
2014-01-15 13:55:14 -08:00
Patrick Wendell 00a3f7eec5 Workers should use working directory as spark home if it's not specified 2014-01-15 11:05:36 -08:00
Mridul Muralidharan 0aea33d39e Expose method and class - so that we can use it from user code (particularly since checkpoint directory is autogenerated now 2014-01-15 12:44:44 +05:30
Tathagata Das 0e15bd7827 Merge remote-tracking branch 'apache/master' into filestream-fix 2014-01-14 22:21:20 -08:00
Tathagata Das 1f4718c480 Changed SparkConf to not be serializable. And also fixed unit-test log paths in log4j.properties of external modules. 2014-01-14 22:20:14 -08:00
Reynold Xin 74b46acdc5 Merge pull request #428 from pwendell/writeable-objects
Don't clone records for text files
2014-01-14 14:59:13 -08:00
Reynold Xin d601a76d1f Merge pull request #427 from pwendell/deprecate-aggregator
Deprecate rather than remove old combineValuesByKey function
2014-01-14 14:52:24 -08:00
Patrick Wendell b1b22b7a13 Style fix 2014-01-14 13:56:27 -08:00
Patrick Wendell 8ea2cd56e4 Adding fix covering combineCombinersByKey as well 2014-01-14 13:52:23 -08:00
Patrick Wendell b683608c9f Deprecate rather than remove old combineValuesByKey function 2014-01-14 12:15:10 -08:00
Patrick Wendell 6f965a46a9 Don't clone records for text files 2014-01-14 11:57:53 -08:00
Reynold Xin f12e506c9e Fixed a typo in JavaSparkContext's API doc. 2014-01-14 11:42:28 -08:00
Reynold Xin 1b5623fd0b Maintain Serializable API compatibility by reverting back to java.io.Serializable for Broadcast and Accumulator. 2014-01-14 11:30:59 -08:00
Reynold Xin 55db77416b Added license header for package.scala in the Java API package. 2014-01-14 11:20:12 -08:00
Reynold Xin f8c12e9457 Added package doc for the Java API. 2014-01-14 11:16:25 -08:00
Reynold Xin 6a12b9ebc5 Updated API doc for Accumulable and Accumulator. 2014-01-14 11:16:08 -08:00
Reynold Xin 71b3007dbd Broadcast variable visibility change & doc update.
Note that previously Broadcast class was accidentally marked as private[spark]. It needs to be public
for broadcast variables to work. Also exposing the broadcast varaible id.
2014-01-14 11:15:21 -08:00
Patrick Wendell 23034798d7 Add missing header files 2014-01-14 01:17:13 -08:00
Saurabh Rawat 1442cd5d50 Modifications as suggested in PR feedback-
- more variants of mapPartitions added to JavaRDDLike
- move setGenerator to JavaRDDLike
- clean up
2014-01-14 14:19:02 +05:30
Patrick Wendell 0984647aae Enable compression by default for spills 2014-01-13 23:25:25 -08:00
Patrick Wendell 4a805aff5e Merge pull request #367 from ankurdave/graphx
GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
2014-01-13 22:58:38 -08:00
Patrick Wendell 945fe7a37e Merge pull request #408 from pwendell/external-serializers
Improvements to external sorting

1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 22:56:12 -08:00
Patrick Wendell 68641bce61 Merge pull request #413 from rxin/scaladoc
Adjusted visibility of various components and documentation for 0.9.0 release.
2014-01-13 22:54:13 -08:00
Patrick Wendell 0ca0d4d657 Merge pull request #401 from andrewor14/master
External sorting - Add number of bytes spilled to Web UI

Additionally, update test suite for external sorting to induce spilling.
2014-01-13 22:32:21 -08:00