Commit graph

6601 commits

Author SHA1 Message Date
witgo cc2655a237 Fix SPARK-1256: Master web UI and Worker web UI returns a 404 error
Author: witgo <witgo@qq.com>

Closes #150 from witgo/SPARK-1256 and squashes the following commits:

08044a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1256
c99b030 [witgo] Fix SPARK-1256
2014-03-18 21:57:47 -07:00
Xiangrui Meng f9d8a83c00 [SPARK-1266] persist factors in implicit ALS
In implicit ALS computation, the user or product factor is used twice in each iteration. Caching can certainly help accelerate the computation. I saw the running time decreased by ~70% for implicit ALS on the movielens data.

I also made the following changes:

1. Change `YtYb` type from `Broadcast[Option[DoubleMatrix]]` to `Option[Broadcast[DoubleMatrix]]`, so we don't need to broadcast None in explicit computation.

2. Mark methods `computeYtY`, `unblockFactors`, `updateBlock`, and `updateFeatures private`. Users do not need those methods.

3. Materialize the final matrix factors before returning the model. It allows us to clean up other cached RDDs before returning the model. I do not have a better solution here, so I use `RDD.count()`.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1266

Author: Xiangrui Meng <meng@databricks.com>

Closes #165 from mengxr/als and squashes the following commits:

c9676a6 [Xiangrui Meng] add a comment about the last products.persist
d3a88aa [Xiangrui Meng] change implicitPrefs match to if ... else ...
63862d6 [Xiangrui Meng] persist factors in implicit ALS
2014-03-18 17:20:42 -07:00
Xiangrui Meng e108b9ab94 [SPARK-1260]: faster construction of features with intercept
The current implementation uses `Array(1.0, features: _*)` to construct a new array with intercept. This is not efficient for big arrays because `Array.apply` uses a for loop that iterates over the arguments. `Array.+:` is a better choice here.

Also, I don't see a reason to set initial weights to ones. So I set them to zeros.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1260

Author: Xiangrui Meng <meng@databricks.com>

Closes #161 from mengxr/sgd and squashes the following commits:

b5cfc53 [Xiangrui Meng] set default weights to zeros
a1439c2 [Xiangrui Meng] faster construction of features with intercept
2014-03-18 15:14:13 -07:00
Matei Zaharia 79e547fe5a Update copyright year in NOTICE to 2014
Author: Matei Zaharia <matei@databricks.com>

Closes #174 from mateiz/update-notice and squashes the following commits:

47fc1a5 [Matei Zaharia] Update copyright year in NOTICE to 2014
2014-03-18 14:34:31 -07:00
CodingCat 2fa26ec02f SPARK-1102: Create a saveAsNewAPIHadoopDataset method
https://spark-project.atlassian.net/browse/SPARK-1102

Create a saveAsNewAPIHadoopDataset method

By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset."

Author: CodingCat <zhunansjtu@gmail.com>

Closes #12 from CodingCat/SPARK-1102 and squashes the following commits:

6ba0c83 [CodingCat] add test cases for saveAsHadoopDataSet (new&old API)
a8d11ba [CodingCat] style fix.........
95a6929 [CodingCat] code clean
7643c88 [CodingCat] change the parameter type back to Configuration
a8583ee [CodingCat] Create a saveAsNewAPIHadoopDataset method
2014-03-18 11:06:18 -07:00
Patrick Wendell e7423d4040 Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
This reverts commit ca4bf8c572.

Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin

Author: Patrick Wendell <pwendell@gmail.com>

Closes #167 from pwendell/jetty-revert and squashes the following commits:

811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
2014-03-18 00:46:03 -07:00
Dan McClary e3681f26fa Spark 1246 add min max to stat counter
Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.

Author: Dan McClary <dan.mcclary@gmail.com>

Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:

fd3fd4b [Dan McClary] fixed  error, updated test
82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
1e7056d [Dan McClary] added underscore to getBucket
37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
29981f2 [Dan McClary] fixed indentation on doctest comment
eaf89d9 [Dan McClary] added correct doctest for histogram
4916016 [Dan McClary] added histogram method, added max and min to statscounter
2014-03-18 00:45:47 -07:00
Diana Carroll 087eedca32 [Spark-1261] add instructions for running python examples to doc overview page
Author: Diana Carroll <dcarroll@cloudera.com>

Closes #162 from dianacarroll/SPARK-1261 and squashes the following commits:

14ac602 [Diana Carroll] typo in python example text
5121e3e [Diana Carroll] Add explanation of how to run Python examples to main doc overview page
2014-03-17 17:35:51 -07:00
Patrick Wendell 796977acdb SPARK-1244: Throw exception if map output status exceeds frame size
This is a very small change on top of @andrewor14's patch in #147.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #152 from pwendell/akka-frame and squashes the following commits:

e5fb3ff [Patrick Wendell] Reversing test order
393af4c [Patrick Wendell] Small improvement suggested by Andrew Or
8045103 [Patrick Wendell] Breaking out into two tests
2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size
c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular
281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests
2014-03-17 14:03:32 -07:00
CodingCat dc9654638f SPARK-1240: handle the case of empty RDD when takeSample
https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

Author: CodingCat <zhunansjtu@gmail.com>

Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:

fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample
2014-03-16 22:14:59 -07:00
Reynold Xin f5486e9f75 SPARK-1255: Allow user to pass Serializer object instead of class name for shuffle.
This is more general than simply passing a string name and leaves more room for performance optimizations.

Note that this is technically an API breaking change in the following two ways:
1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark.
2. Serializer's in Spark from now on are required to be serializable.

Author: Reynold Xin <rxin@apache.org>

Closes #149 from rxin/serializer and squashes the following commits:

5acaccd [Reynold Xin] Properly call serializer's constructors.
2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency.
7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.
2014-03-16 09:57:21 -07:00
Sean Owen 97e4459e1e SPARK-1254. Consolidate, order, and harmonize repository declarations in Maven/SBT builds
This suggestion addresses a few minor suboptimalities with how repositories are handled.

1) Use HTTPS consistently to access repos, instead of HTTP

2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)

3) Update SBT build to match Maven build in this regard

4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.

Author: Sean Owen <sowen@cloudera.com>

Closes #145 from srowen/SPARK-1254 and squashes the following commits:

42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
2014-03-15 16:44:34 -07:00
Michael Armbrust e19044cb10 Fix serialization of MutablePair. Also provide an interface for easy updating.
Author: Michael Armbrust <michael@databricks.com>

Closes #141 from marmbrus/mutablePair and squashes the following commits:

f5c4783 [Michael Armbrust] Change function name to update
8bfd973 [Michael Armbrust] Fix serialization of MutablePair.  Also provide an interface for easy updating.
2014-03-14 11:40:26 -07:00
Tianshuo Deng 181b130a0c [bugfix] wrong client arg, should use executor-cores
client arg is wrong, it should be executor-cores. it causes executor fail to start when executor-cores is specified

Author: Tianshuo Deng <tdeng@twitter.com>

Closes #138 from tsdeng/bugfix_wrong_client_args and squashes the following commits:

304826d [Tianshuo Deng] wrong client arg, should use executor-cores
2014-03-13 20:27:36 -07:00
Reynold Xin ca4bf8c572 SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.
Author: Reynold Xin <rxin@apache.org>

Closes #113 from rxin/jetty9 and squashes the following commits:

867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file.
d7c97ca [Reynold Xin] Return the correctly bound port.
d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
2014-03-13 12:16:04 -07:00
Sandy Ryza 698373211e SPARK-1183. Don't use "worker" to mean executor
Author: Sandy Ryza <sandy@cloudera.com>

Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:

5066a4a [Sandy Ryza] Remove "worker" in a couple comments
0bd1e46 [Sandy Ryza] Remove --am-class from usage
bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
607539f [Sandy Ryza] Address review comments
74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
2014-03-13 12:11:33 -07:00
Xiangrui Meng e4e8d8f395 [SPARK-1237, 1238] Improve the computation of YtY for implicit ALS
Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data.

To compare the results, I also added an option to set a random seed in ALS.

JIRA:
1. https://spark-project.atlassian.net/browse/SPARK-1237
2. https://spark-project.atlassian.net/browse/SPARK-1238

Author: Xiangrui Meng <meng@databricks.com>

Closes #131 from mengxr/als and squashes the following commits:

ed00432 [Xiangrui Meng] minor changes
d984623 [Xiangrui Meng] minor changes
2fc1641 [Xiangrui Meng] remove commented code
4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS
200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock
2014-03-13 00:43:19 -07:00
Patrick Wendell 4ea23db0ef SPARK-1019: pyspark RDD take() throws an NPE
Author: Patrick Wendell <pwendell@gmail.com>

Closes #112 from pwendell/pyspark-take and squashes the following commits:

daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE
2014-03-12 23:16:59 -07:00
CodingCat 6bd2eaa4a5 hot fix for PR105 - change to Java annotation
Author: CodingCat <zhunansjtu@gmail.com>

Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits:

6607155 [CodingCat] hot fix for PR105 - change to Java annotation
2014-03-12 19:49:18 -07:00
jianghan 31a704004f Fix example bug: compile error
Author: jianghan <jianghan@xiaomi.com>

Closes #132 from pooorman/master and squashes the following commits:

54afbe0 [jianghan] Fix example bug: compile error
2014-03-12 19:46:12 -07:00
CodingCat 9032f7c0d5 SPARK-1160: Deprecate toArray in RDD
https://spark-project.atlassian.net/browse/SPARK-1160

reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python."

In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly

Author: CodingCat <zhunansjtu@gmail.com>

Closes #105 from CodingCat/SPARK-1060 and squashes the following commits:

286f163 [CodingCat] deprecate in JavaRDDLike
ee17b4e [CodingCat] add message and since
2ff7319 [CodingCat] deprecate toArray in RDD
2014-03-12 17:43:12 -07:00
Prashant Sharma b8afe30520 SPARK-1162 Added top in python.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:

ece1fa4 [Prashant Sharma] Added top in python.
2014-03-12 15:57:44 -07:00
liguoqiang 5d1ec64e79 Fix #SPARK-1149 Bad partitioners can cause Spark to hang
Author: liguoqiang <liguoqiang@rd.tuan800.com>

Closes #44 from witgo/SPARK-1149 and squashes the following commits:

3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149
8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
3dad595 [liguoqiang] review comment
e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149
b0d5c07 [liguoqiang] review comment
d0a6005 [liguoqiang] review comment
3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
ac006a3 [liguoqiang] code Formatting
3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149
adc443e [liguoqiang] partitions check  bugfix
928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner
db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149
1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149
3348619 [liguoqiang] Optimize performance for partitions check
61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149
e68210a [liguoqiang] add partition index check to submitJob
3a65903 [liguoqiang] make the code more readable
6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang
2014-03-12 13:00:04 -07:00
Thomas Graves b5162f4426 [SPARK-1233] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_M...
...APREDUCE_APPLICATION_CLASSPATH

Author: Thomas Graves <tgraves@apache.org>

Closes #129 from tgravescs/SPARK-1233 and squashes the following commits:

85ff5a6 [Thomas Graves] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH
2014-03-12 11:25:41 -07:00
Thomas Graves c8c59b326e [SPARK-1232] Fix the hadoop 0.23 yarn build
Author: Thomas Graves <tgraves@apache.org>

Closes #127 from tgravescs/SPARK-1232 and squashes the following commits:

c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build
2014-03-12 10:32:01 -07:00
prabinb af7f2f1090 Spark-1163, Added missing Python RDD functions
Author: prabinb <prabin.banka@imaginea.com>

Closes #92 from prabinb/python-api-rdd and squashes the following commits:

51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
2014-03-11 23:57:05 -07:00
Sandy Ryza 2409af9dcf SPARK-1064
This reopens PR 649 from incubator-spark against the new repo

Author: Sandy Ryza <sandy@cloudera.com>

Closes #102 from sryza/sandy-spark-1064 and squashes the following commits:

270e490 [Sandy Ryza] Handle different application classpath variables in different versions
88b04e0 [Sandy Ryza] SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in Spark assembly
2014-03-11 22:39:17 -07:00
Patrick Wendell 16788a6542 SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues...
This patch removes Ganglia integration from the default build. It
allows users willing to link against LGPL code to use Ganglia
by adding build flags or linking against a new Spark artifact called
spark-ganglia-lgpl.

This brings Spark in line with the Apache policy on LGPL code
enumerated here:

https://www.apache.org/legal/3party.html#options-optional

Author: Patrick Wendell <pwendell@gmail.com>

Closes #108 from pwendell/ganglia and squashes the following commits:

326712a [Patrick Wendell] Responding to review feedback
5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
2014-03-11 11:16:59 -07:00
Sandy Ryza 2a2c9645e4 SPARK-1211. In ApplicationMaster, set spark.master system property to "y...
...arn-cluster"

Author: Sandy Ryza <sandy@cloudera.com>

Closes #118 from sryza/sandy-spark-1211 and squashes the following commits:

d4001c7 [Sandy Ryza] SPARK-1211. In ApplicationMaster, set spark.master system property to "yarn-cluster"
2014-03-10 17:42:33 -07:00
Patrick Wendell 2a5161708f SPARK-1205: Clean up callSite/origin/generator.
This patch removes the `generator` field and simplifies + documents
the tracking of callsites.

There are two places where we care about call sites, when a job is
run and when an RDD is created. This patch retains both of those
features but does a slight refactoring and renaming to make things
less confusing.

There was another feature of an rdd called the `generator` which was
by default the user class that in which the RDD was created. This is
used exclusively in the JobLogger. It been subsumed by the ability
to name a job group. The job logger can later be refectored to
read the job group directly (will require some work) but for now
this just preserves the default logged value of the user class.
I'm not sure any users ever used the ability to override this.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #106 from pwendell/callsite and squashes the following commits:

fc1d009 [Patrick Wendell] Compile fix
e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite
62e77ef [Patrick Wendell] Review feedback
576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.
2014-03-10 16:28:41 -07:00
Prashant Sharma a59419c27e SPARK-1168, Added foldByKey to pyspark.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:

db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
2014-03-10 13:37:11 -07:00
jyotiska f5518989b6 [SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted)
Author: jyotiska <jyotiska123@gmail.com>

Closes #34 from jyotiska/pyspark_code and squashes the following commits:

c9439be [jyotiska] replaced dict with namedtuple
a6bf4cd [jyotiska] added callsite info for context.py
2014-03-10 13:34:49 -07:00
Prabin Banka e1e09e0ef6 SPARK-977 Added Python RDD.zip function
was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #76 from prabinb/python-api-zip and squashes the following commits:

b1a31a0 [Prabin Banka] Added Python RDD.zip function
2014-03-10 13:27:00 -07:00
Chen Chao 5d98cfc1c8 maintain arbitrary state data for each key
RT

Author: Chen Chao <crazyjvm@gmail.com>

Closes #114 from CrazyJvm/patch-1 and squashes the following commits:

dcb0df5 [Chen Chao] maintain arbitrary state data for each key
2014-03-09 22:42:12 -07:00
Patrick Wendell b9be160951 SPARK-782 Clean up for ASM dependency.
This makes two changes.

1) Spark uses the shaded version of asm that is (conveniently) published
   with Kryo.
2) Existing exclude rules around asm are updated to reflect the new groupId
   of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop
   versions that pull in new asm versions.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #100 from pwendell/asm and squashes the following commits:

9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.
2014-03-09 13:17:07 -07:00
Patrick Wendell faf4cad1de Fix markup errors introduced in #33 (SPARK-1189)
These were causing errors on the configuration page.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #111 from pwendell/master and squashes the following commits:

8467a86 [Patrick Wendell] Fix markup errors introduced in #33 (SPARK-1189)
2014-03-09 11:57:06 -07:00
Jiacheng Guo f6f9d02e85 Add timeout for fetch file
Currently, when fetch a file, the connection's connect timeout
    and read timeout is based on the default jvm setting, in this change, I change it to
    use spark.worker.timeout. This can be usefull, when the
    connection status between worker is not perfect. And prevent
    prematurely remove task set.

Author: Jiacheng Guo <guojc03@gmail.com>

Closes #98 from guojc/master and squashes the following commits:

abfe698 [Jiacheng Guo] add space according request
2a37c34 [Jiacheng Guo] Add timeout for fetch file     Currently, when fetch a file, the connection's connect timeout     and read timeout is based on the default jvm setting, in this change, I change it to     use spark.worker.timeout. This can be usefull, when the     connection status between worker is not perfect. And prevent     prematurely remove task set.
2014-03-09 11:38:40 -07:00
Aaron Davidson 52834d761b SPARK-929: Fully deprecate usage of SPARK_MEM
(Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)

This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY

The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.

SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.

SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.

Other memory considerations:
- The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
- run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).

This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.

Author: Aaron Davidson <aaron@databricks.com>

Closes #99 from aarondav/sparkmem and squashes the following commits:

9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
2014-03-09 11:08:39 -07:00
Patrick Wendell e59a3b6c41 SPARK-1190: Do not initialize log4j if slf4j log4j backend is not being used
Author: Patrick Wendell <pwendell@gmail.com>

Closes #107 from pwendell/logging and squashes the following commits:

be21c11 [Patrick Wendell] Logging fix
2014-03-08 16:02:42 -08:00
Reynold Xin c2834ec081 Update junitxml plugin to the latest version to avoid recompilation in every SBT command.
Author: Reynold Xin <rxin@apache.org>

Closes #104 from rxin/junitxml and squashes the following commits:

67ef7bf [Reynold Xin] Update junitxml plugin to the latest version to avoid recompilation in every SBT command.
2014-03-08 12:40:26 -08:00
Cheng Lian 0b7b7fd45c [SPARK-1194] Fix the same-RDD rule for cache replacement
SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194

In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails  and aborts.

In this PR, we keep selecting blocks *not* from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails).

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #96 from liancheng/fix-spark-1194 and squashes the following commits:

2524ab9 [Cheng Lian] Added regression test case for SPARK-1194
6e40c22 [Cheng Lian] Remove redundant comments
40cdcb2 [Cheng Lian] Bug fix, and addressed PR comments from @mridulm
62c92ac [Cheng Lian] Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194
2014-03-07 23:26:46 -08:00
Reynold Xin 8ad486add9 Allow sbt to use more than 1G of heap.
There was a mistake in sbt build file ( introduced by 012bd5fbc9 ) in which we set the default to 2048 and the immediately reset it to 1024.

Without this, building Spark can run out of permgen space on my machine.

Author: Reynold Xin <rxin@apache.org>

Closes #103 from rxin/sbt and squashes the following commits:

8829c34 [Reynold Xin] Allow sbt to use more than 1G of heap.
2014-03-07 23:23:59 -08:00
Sandy Ryza a99fb3747a SPARK-1193. Fix indentation in pom.xmls
Author: Sandy Ryza <sandy@cloudera.com>

Closes #91 from sryza/sandy-spark-1193 and squashes the following commits:

a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
2014-03-07 23:10:35 -08:00
Prashant Sharma 6e730edcde Spark 1165 rdd.intersection in python and java
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:

9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
1fea813 [Prashant Sharma] correct the lines wrapping
d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
2014-03-07 18:48:07 -08:00
Thomas Graves b7cd9e992c SPARK-1195: set map_input_file environment variable in PipedRDD
Hadoop uses the config mapreduce.map.input.file to indicate the input filename to the map when the input split is of type FileSplit. Some of the hadoop input and output formats set or use this config. This config can also be used by user code.
PipedRDD runs an external process and the configs aren't available to that process. Hadoop Streaming does something very similar and the way they make configs available is exporting them into the environment replacing '.' with '_'. Spark should also export this variable when launching the pipe command so the user code has access to that config.
Note that the config mapreduce.map.input.file is the new one, the old one which is deprecated but not yet removed is map.input.file. So we should handle both.

Perhaps it would be better to abstract this out somehow so it goes into the HadoopParition code?

Author: Thomas Graves <tgraves@apache.org>

Closes #94 from tgravescs/map_input_file and squashes the following commits:

cc97a6a [Thomas Graves] Update test to check for existence of command, add a getPipeEnvVars function to HadoopRDD
e3401dc [Thomas Graves] Merge remote-tracking branch 'upstream/master' into map_input_file
2ba805e [Thomas Graves] set map_input_file environment variable in PipedRDD
2014-03-07 10:36:55 -08:00
Aaron Davidson dabeb6f160 SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1
This patch allows the FaultToleranceTest to work in newer versions of Docker.
See https://spark-project.atlassian.net/browse/SPARK-1136 for more details.

Besides changing the Docker and FaultToleranceTest internals, this patch also changes the behavior of Master to accept new Workers which share an address with a Worker that we are currently trying to recover. This can only happen when the Worker itself was restarted and got the same IP address/port at the same time as a Master recovery occurs.

Finally, this adds a good bit of ASCII art to the test to make failures, successes, and actions more apparent. This is very much needed.

Author: Aaron Davidson <aaron@databricks.com>

Closes #5 from aarondav/zookeeper and squashes the following commits:

5d7a72a [Aaron Davidson] SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1
2014-03-07 10:22:27 -08:00
Patrick Wendell 33baf14b04 Small clean-up to flatmap tests 2014-03-06 17:57:31 -08:00
anitatailor 9ae919c02f Example for cassandra CQL read/write from spark
Cassandra read/write using CqlPagingInputFormat/CqlOutputFormat

Author: anitatailor <tailor.anita@gmail.com>

Closes #87 from anitatailor/master and squashes the following commits:

3493f81 [anitatailor] Fixed scala style as per review
19480b7 [anitatailor] Example for cassandra CQL read/write from spark
2014-03-06 17:46:43 -08:00
Sandy Ryza 328c73d037 SPARK-1197. Change yarn-standalone to yarn-cluster and fix up running on YARN docs
This patch changes "yarn-standalone" to "yarn-cluster" (but still supports the former).  It also cleans up the Running on YARN docs and adds a section on how to view logs.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #95 from sryza/sandy-spark-1197 and squashes the following commits:

563ef3a [Sandy Ryza] Review feedback
6ad06d4 [Sandy Ryza] Change yarn-standalone to yarn-cluster and fix up running on YARN docs
2014-03-06 17:12:58 -08:00
Thomas Graves 7edbea41b4 SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets
resubmit pull request.  was https://github.com/apache/incubator-spark/pull/332.

Author: Thomas Graves <tgraves@apache.org>

Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits:

dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser
05eebed [Thomas Graves] Fix dependency lost in upmerge
d1040ec [Thomas Graves] Fix up various imports
05ff5e0 [Thomas Graves] Fix up imports after upmerging to master
ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase
13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests.
4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets
2f77147 [Thomas Graves] Rework from comments
50dd9f2 [Thomas Graves] fix header in SecurityManager
ecbfb65 [Thomas Graves] Fix spacing and formatting
b514bec [Thomas Graves] Fix reference to config
ed3d1c1 [Thomas Graves] Add security.md
6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments
2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework
5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils
f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
2014-03-06 18:27:50 -06:00