Commit graph

6569 commits

Author SHA1 Message Date
Prabin Banka e1e09e0ef6 SPARK-977 Added Python RDD.zip function
was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #76 from prabinb/python-api-zip and squashes the following commits:

b1a31a0 [Prabin Banka] Added Python RDD.zip function
2014-03-10 13:27:00 -07:00
Chen Chao 5d98cfc1c8 maintain arbitrary state data for each key
RT

Author: Chen Chao <crazyjvm@gmail.com>

Closes #114 from CrazyJvm/patch-1 and squashes the following commits:

dcb0df5 [Chen Chao] maintain arbitrary state data for each key
2014-03-09 22:42:12 -07:00
Patrick Wendell b9be160951 SPARK-782 Clean up for ASM dependency.
This makes two changes.

1) Spark uses the shaded version of asm that is (conveniently) published
   with Kryo.
2) Existing exclude rules around asm are updated to reflect the new groupId
   of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop
   versions that pull in new asm versions.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #100 from pwendell/asm and squashes the following commits:

9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.
2014-03-09 13:17:07 -07:00
Patrick Wendell faf4cad1de Fix markup errors introduced in #33 (SPARK-1189)
These were causing errors on the configuration page.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #111 from pwendell/master and squashes the following commits:

8467a86 [Patrick Wendell] Fix markup errors introduced in #33 (SPARK-1189)
2014-03-09 11:57:06 -07:00
Jiacheng Guo f6f9d02e85 Add timeout for fetch file
Currently, when fetch a file, the connection's connect timeout
    and read timeout is based on the default jvm setting, in this change, I change it to
    use spark.worker.timeout. This can be usefull, when the
    connection status between worker is not perfect. And prevent
    prematurely remove task set.

Author: Jiacheng Guo <guojc03@gmail.com>

Closes #98 from guojc/master and squashes the following commits:

abfe698 [Jiacheng Guo] add space according request
2a37c34 [Jiacheng Guo] Add timeout for fetch file     Currently, when fetch a file, the connection's connect timeout     and read timeout is based on the default jvm setting, in this change, I change it to     use spark.worker.timeout. This can be usefull, when the     connection status between worker is not perfect. And prevent     prematurely remove task set.
2014-03-09 11:38:40 -07:00
Aaron Davidson 52834d761b SPARK-929: Fully deprecate usage of SPARK_MEM
(Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)

This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY

The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.

SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.

SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.

Other memory considerations:
- The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
- run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).

This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.

Author: Aaron Davidson <aaron@databricks.com>

Closes #99 from aarondav/sparkmem and squashes the following commits:

9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
2014-03-09 11:08:39 -07:00
Patrick Wendell e59a3b6c41 SPARK-1190: Do not initialize log4j if slf4j log4j backend is not being used
Author: Patrick Wendell <pwendell@gmail.com>

Closes #107 from pwendell/logging and squashes the following commits:

be21c11 [Patrick Wendell] Logging fix
2014-03-08 16:02:42 -08:00
Reynold Xin c2834ec081 Update junitxml plugin to the latest version to avoid recompilation in every SBT command.
Author: Reynold Xin <rxin@apache.org>

Closes #104 from rxin/junitxml and squashes the following commits:

67ef7bf [Reynold Xin] Update junitxml plugin to the latest version to avoid recompilation in every SBT command.
2014-03-08 12:40:26 -08:00
Cheng Lian 0b7b7fd45c [SPARK-1194] Fix the same-RDD rule for cache replacement
SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194

In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails  and aborts.

In this PR, we keep selecting blocks *not* from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails).

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #96 from liancheng/fix-spark-1194 and squashes the following commits:

2524ab9 [Cheng Lian] Added regression test case for SPARK-1194
6e40c22 [Cheng Lian] Remove redundant comments
40cdcb2 [Cheng Lian] Bug fix, and addressed PR comments from @mridulm
62c92ac [Cheng Lian] Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194
2014-03-07 23:26:46 -08:00
Reynold Xin 8ad486add9 Allow sbt to use more than 1G of heap.
There was a mistake in sbt build file ( introduced by 012bd5fbc9 ) in which we set the default to 2048 and the immediately reset it to 1024.

Without this, building Spark can run out of permgen space on my machine.

Author: Reynold Xin <rxin@apache.org>

Closes #103 from rxin/sbt and squashes the following commits:

8829c34 [Reynold Xin] Allow sbt to use more than 1G of heap.
2014-03-07 23:23:59 -08:00
Sandy Ryza a99fb3747a SPARK-1193. Fix indentation in pom.xmls
Author: Sandy Ryza <sandy@cloudera.com>

Closes #91 from sryza/sandy-spark-1193 and squashes the following commits:

a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
2014-03-07 23:10:35 -08:00
Prashant Sharma 6e730edcde Spark 1165 rdd.intersection in python and java
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:

9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
1fea813 [Prashant Sharma] correct the lines wrapping
d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
2014-03-07 18:48:07 -08:00
Thomas Graves b7cd9e992c SPARK-1195: set map_input_file environment variable in PipedRDD
Hadoop uses the config mapreduce.map.input.file to indicate the input filename to the map when the input split is of type FileSplit. Some of the hadoop input and output formats set or use this config. This config can also be used by user code.
PipedRDD runs an external process and the configs aren't available to that process. Hadoop Streaming does something very similar and the way they make configs available is exporting them into the environment replacing '.' with '_'. Spark should also export this variable when launching the pipe command so the user code has access to that config.
Note that the config mapreduce.map.input.file is the new one, the old one which is deprecated but not yet removed is map.input.file. So we should handle both.

Perhaps it would be better to abstract this out somehow so it goes into the HadoopParition code?

Author: Thomas Graves <tgraves@apache.org>

Closes #94 from tgravescs/map_input_file and squashes the following commits:

cc97a6a [Thomas Graves] Update test to check for existence of command, add a getPipeEnvVars function to HadoopRDD
e3401dc [Thomas Graves] Merge remote-tracking branch 'upstream/master' into map_input_file
2ba805e [Thomas Graves] set map_input_file environment variable in PipedRDD
2014-03-07 10:36:55 -08:00
Aaron Davidson dabeb6f160 SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1
This patch allows the FaultToleranceTest to work in newer versions of Docker.
See https://spark-project.atlassian.net/browse/SPARK-1136 for more details.

Besides changing the Docker and FaultToleranceTest internals, this patch also changes the behavior of Master to accept new Workers which share an address with a Worker that we are currently trying to recover. This can only happen when the Worker itself was restarted and got the same IP address/port at the same time as a Master recovery occurs.

Finally, this adds a good bit of ASCII art to the test to make failures, successes, and actions more apparent. This is very much needed.

Author: Aaron Davidson <aaron@databricks.com>

Closes #5 from aarondav/zookeeper and squashes the following commits:

5d7a72a [Aaron Davidson] SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1
2014-03-07 10:22:27 -08:00
Patrick Wendell 33baf14b04 Small clean-up to flatmap tests 2014-03-06 17:57:31 -08:00
anitatailor 9ae919c02f Example for cassandra CQL read/write from spark
Cassandra read/write using CqlPagingInputFormat/CqlOutputFormat

Author: anitatailor <tailor.anita@gmail.com>

Closes #87 from anitatailor/master and squashes the following commits:

3493f81 [anitatailor] Fixed scala style as per review
19480b7 [anitatailor] Example for cassandra CQL read/write from spark
2014-03-06 17:46:43 -08:00
Sandy Ryza 328c73d037 SPARK-1197. Change yarn-standalone to yarn-cluster and fix up running on YARN docs
This patch changes "yarn-standalone" to "yarn-cluster" (but still supports the former).  It also cleans up the Running on YARN docs and adds a section on how to view logs.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #95 from sryza/sandy-spark-1197 and squashes the following commits:

563ef3a [Sandy Ryza] Review feedback
6ad06d4 [Sandy Ryza] Change yarn-standalone to yarn-cluster and fix up running on YARN docs
2014-03-06 17:12:58 -08:00
Thomas Graves 7edbea41b4 SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets
resubmit pull request.  was https://github.com/apache/incubator-spark/pull/332.

Author: Thomas Graves <tgraves@apache.org>

Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits:

dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser
05eebed [Thomas Graves] Fix dependency lost in upmerge
d1040ec [Thomas Graves] Fix up various imports
05ff5e0 [Thomas Graves] Fix up imports after upmerging to master
ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase
13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests.
4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets
2f77147 [Thomas Graves] Rework from comments
50dd9f2 [Thomas Graves] fix header in SecurityManager
ecbfb65 [Thomas Graves] Fix spacing and formatting
b514bec [Thomas Graves] Fix reference to config
ed3d1c1 [Thomas Graves] Add security.md
6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments
2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework
5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils
f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
2014-03-06 18:27:50 -06:00
Kyle Ellrott 40566e10aa SPARK-942: Do not materialize partitions when DISK_ONLY storage level is used
This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180

Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer.

To do this, two changes where made:
1) The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly.
2) The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions.

Author: Kyle Ellrott <kellrott@gmail.com>

Closes #50 from kellrott/iterator-to-disk and squashes the following commits:

9ef7cb8 [Kyle Ellrott] Fixing formatting issues.
60e0c57 [Kyle Ellrott] Fixing issues (formatting, variable names, etc.) from review comments
8aa31cd [Kyle Ellrott] Merge ../incubator-spark into iterator-to-disk
33ac390 [Kyle Ellrott] Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spark into iterator-to-disk
2f684ea [Kyle Ellrott] Refactoring the BlockManager to replace the Either[Either[A,B]] usage. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues.
f70d069 [Kyle Ellrott] Adding docs for spark.serializer.objectStreamReset configuration
7ccc74b [Kyle Ellrott] Moving the 'LargeIteratorSuite' to simply test persistance of iterators. It doesn't try to invoke a OOM error any more
16a4cea [Kyle Ellrott] Streamlined the LargeIteratorSuite unit test. It should now run in ~25 seconds. Confirmed that it still crashes an unpatched copy of Spark.
c2fb430 [Kyle Ellrott] Removing more un-needed array-buffer to iterator conversions
627a8b7 [Kyle Ellrott] Wrapping a few long lines
0f28ec7 [Kyle Ellrott] Adding second putValues to BlockStore interface that accepts an ArrayBuffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication.
656c33e [Kyle Ellrott] Fixing the JavaSerializer to read from the SparkConf rather then the System property.
8644ee8 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk
00c98e0 [Kyle Ellrott] Making the Java ObjectStreamSerializer reset rate configurable by the system variable 'spark.serializer.objectStreamReset', default is not 10000.
40fe1d7 [Kyle Ellrott] Removing rouge space
31fe08e [Kyle Ellrott] Removing un-needed semi-colons
9df0276 [Kyle Ellrott] Added check to make sure that streamed-to-dist RDD actually returns good data in the LargeIteratorSuite
a6424ba [Kyle Ellrott] Wrapping long line
2eeda75 [Kyle Ellrott] Fixing dumb mistake ("||" instead of "&&")
0e6f808 [Kyle Ellrott] Deleting temp output directory when done
95c7f67 [Kyle Ellrott] Simplifying StorageLevel checks
56f71cd [Kyle Ellrott] Merge branch 'master' into iterator-to-disk
44ec35a [Kyle Ellrott] Adding some comments.
5eb2b7e [Kyle Ellrott] Changing the JavaSerializer reset to occur every 1000 objects.
f403826 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk
81d670c [Kyle Ellrott] Adding unit test for straight to disk iterator methods.
d32992f [Kyle Ellrott] Merge remote-tracking branch 'origin/master' into iterator-to-disk
cac1fad [Kyle Ellrott] Fixing MemoryStore, so that it converts incoming iterators to ArrayBuffer objects. This was previously done higher up the stack.
efe1102 [Kyle Ellrott] Changing CacheManager and BlockManager to pass iterators directly to the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.
2014-03-06 14:51:19 -08:00
Prabin Banka 3d3acef047 SPARK-1187, Added missing Python APIs
The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()

was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #75 from prabinb/python-api-backup and squashes the following commits:

cc3c6cd [Prabin Banka] Added missing Python APIs
2014-03-06 12:45:27 -08:00
CodingCat 3eb009f362 SPARK-1156: allow user to login into a cluster without slaves
Reported in https://spark-project.atlassian.net/browse/SPARK-1156

The current spark-ec2 script doesn't allow user to login to a cluster without slaves. One of the issues brought by this behaviour is that when all the worker died, the user cannot even login to the cluster for debugging, etc.

Author: CodingCat <zhunansjtu@gmail.com>

Closes #58 from CodingCat/SPARK-1156 and squashes the following commits:

104af07 [CodingCat] output ERROR to stderr
9a71769 [CodingCat] do not allow user to start 0-slave cluster
24a7c79 [CodingCat] allow user to login into a cluster without slaves
2014-03-05 21:47:34 -08:00
Mark Grover cda381f88c SPARK-1184: Update the distribution tar.gz to include spark-assembly jar
See JIRA for details.

Author: Mark Grover <mark@apache.org>

Closes #78 from markgrover/SPARK-1184 and squashes the following commits:

12b78e6 [Mark Grover] SPARK-1184: Update the distribution tar.gz to include spark-assembly jar
2014-03-05 16:52:58 -08:00
liguoqiang 51ca7bd703 Improve building with maven docs
mvn -Dhadoop.version=... -Dsuites=spark.repl.ReplSuite test

to

     mvn -Dhadoop.version=... -Dsuites=org.apache.spark.repl.ReplSuite test

Author: liguoqiang <liguoqiang@rd.tuan800.com>

Closes #70 from witgo/building_with_maven and squashes the following commits:

6ec8a54 [liguoqiang] spark.repl.ReplSuite to org.apache.spark.repl.ReplSuite
2014-03-05 16:38:43 -08:00
CodingCat a3da508819 SPARK-1171: when executor is removed, we should minus totalCores instead of just freeCores on that executor
https://spark-project.atlassian.net/browse/SPARK-1171

When the executor is removed, the current implementation will only minus the freeCores of that executor. Actually we should minus the totalCores...

Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>

Closes #63 from CodingCat/simplify_CoarseGrainedSchedulerBackend and squashes the following commits:

f6bf93f [Nan Zhu] code clean
19c2bb4 [CodingCat] use copy idiom to reconstruct the workerOffers
43c13e9 [CodingCat] keep WorkerOffer immutable
af470d3 [CodingCat] style fix
0c0e409 [CodingCat] simplify the implementation of CoarseGrainedSchedulerBackend
2014-03-05 14:00:28 -08:00
Prashant Sharma 02836657cf SPARK-1109 wrong API docs for pyspark map function
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits:

1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
2014-03-04 15:32:43 -08:00
CodingCat 1865dd681b SPARK-1178: missing document of spark.scheduler.revive.interval
https://spark-project.atlassian.net/browse/SPARK-1178

The configuration on spark.scheduler.revive.interval is undocumented but actually used

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L64

Author: CodingCat <zhunansjtu@gmail.com>

Closes #74 from CodingCat/SPARK-1178 and squashes the following commits:

783ec69 [CodingCat] missing document of spark.scheduler.revive.interval
2014-03-04 10:28:17 -08:00
Prashant Sharma 2d8e0a062c SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #72 from ScrapCodes/SPARK-1164/deprecate-reducebykeytodriver and squashes the following commits:

ee521cd [Prashant Sharma] SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally
2014-03-04 10:27:02 -08:00
Prashant Sharma 181ec50307 [java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits:

95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch.
85a954e [Prashant Sharma] Nit. import orderings.
673f7ac [Prashant Sharma] Added support for -java-home as well
80a13e8 [Prashant Sharma] Used fake class tag syntax
26eb3f6 [Prashant Sharma] Patrick's comments on PR.
35d8d79 [Prashant Sharma] Specified java 8 building in the docs
31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag.
4ab87d3 [Prashant Sharma] Review feedback on the pr
c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
2014-03-03 22:31:30 -08:00
Kay Ousterhout b14ede789a Remove broken/unused Connection.getChunkFIFO method.
This method appears to be broken -- since it never removes
anything from messages, and it adds new messages to it,
the while loop is an infinite loop.  The method also does not appear
to have ever been used since the code was added in 2012, so
this commit removes it.

cc @mateiz who originally added this method in case there's a reason it should be here! (63051dd2bc)

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #69 from kayousterhout/remove_get_fifo and squashes the following commits:

053bc59 [Kay Ousterhout] Remove broken/unused Connection.getChunkFIFO method.
2014-03-03 21:27:18 -08:00
Reynold Xin f5ae38af87 SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.
There was actually a problem with the RateLimitedOutputStream implementation where the first second doesn't write anything because of integer rounding.

So RateLimitedOutputStream was overly aggressive in throttling.

Author: Reynold Xin <rxin@apache.org>

Closes #55 from rxin/ratelimitest and squashes the following commits:

52ce1b7 [Reynold Xin] SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.
2014-03-03 21:24:19 -08:00
Bryn Keller 923dba5096 Added a unit test for PairRDDFunctions.lookup
Lookup didn't have a unit test. Added two tests, one for with a partitioner, and one for without.

Author: Bryn Keller <bryn.keller@intel.com>

Closes #36 from xoltar/lookup and squashes the following commits:

3bc0d44 [Bryn Keller] Added a unit test for PairRDDFunctions.lookup
2014-03-03 16:38:57 -08:00
Kay Ousterhout b55cade853 Remove the remoteFetchTime metric.
This metric is confusing: it adds up all of the time to fetch
shuffle inputs, but fetches often happen in parallel, so
remoteFetchTime can be much longer than the task execution time.

@squito it looks like you added this metric -- do you have a use case for it?

cc @shivaram -- I know you've looked at the shuffle performance a lot so chime in here if this metric has turned out to be useful for you!

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #62 from kayousterhout/remove_fetch_variable and squashes the following commits:

43341eb [Kay Ousterhout] Remote the remoteFetchTime metric.
2014-03-03 16:12:00 -08:00
Chen Chao 9d225a9104 update proportion of memory
The default value of "spark.storage.memoryFraction" has been changed from 0.66 to 0.6 . So it should be 60% of the memory to cache while 40% used for task execution.

Author: Chen Chao <crazyjvm@gmail.com>

Closes #66 from CrazyJvm/master and squashes the following commits:

0f84d86 [Chen Chao] update proportion of memory
2014-03-03 14:41:25 -08:00
Kay Ousterhout 369aad6f9e Removed accidentally checked in comment
It looks like this comment was added a while ago by @mridulm as part of a merge and was accidentally checked in.  We should remove it.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #61 from kayousterhout/remove_comment and squashes the following commits:

0b2b3f2 [Kay Ousterhout] Removed accidentally checked in comment
2014-03-03 14:39:49 -08:00
Aaron Kimball f65c1f38eb SPARK-1173. (#2) Fix typo in Java streaming example.
Companion commit to pull request #64, fix the typo on the Java side of the docs.

Author: Aaron Kimball <aaron@magnify.io>

Closes #65 from kimballa/spark-1173-java-doc-update and squashes the following commits:

8ce11d3 [Aaron Kimball] SPARK-1173. (#2) Fix typo in Java streaming example.
2014-03-02 23:48:48 -08:00
Aaron Kimball 2b53447f32 SPARK-1173. Improve scala streaming docs.
Clarify imports to add implicit conversions to DStream and
fix other small typos in the streaming intro documentation.

Tested by inspecting output via a local jekyll server, c&p'ing the scala commands into a spark terminal.

Author: Aaron Kimball <aaron@magnify.io>

Closes #64 from kimballa/spark-1173-streaming-docs and squashes the following commits:

6fbff0e [Aaron Kimball] SPARK-1173. Improve scala streaming docs.
2014-03-02 23:26:47 -08:00
Patrick Wendell 55a4f11b50 Add Jekyll tag to isolate "production-only" doc components.
Author: Patrick Wendell <pwendell@gmail.com>

Closes #56 from pwendell/jekyll-prod and squashes the following commits:

1bdc3a8 [Patrick Wendell] Add Jekyll tag to isolate "production-only" doc components.
2014-03-02 18:19:01 -08:00
Patrick Wendell c3f5e07533 SPARK-1121: Include avro for yarn-alpha builds
This lets us explicitly include Avro based on a profile for 0.23.X
builds. It makes me sad how convoluted it is to express this logic
in Maven. @tgraves and @sryza curious if this works for you.

I'm also considering just reverting to how it was before. The only
real problem was that Spark advertised a dependency on Avro
even though it only really depends transitively on Avro through
other deps.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #49 from pwendell/avro-build-fix and squashes the following commits:

8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
2014-03-02 15:18:19 -08:00
Sean Owen fd31adbf27 SPARK-1084.2 (resubmitted)
(Ported from https://github.com/apache/incubator-spark/pull/650 )

This adds one more change though, to fix the scala version warning introduced by json4s recently.

Author: Sean Owen <sowen@cloudera.com>

Closes #32 from srowen/SPARK-1084.2 and squashes the following commits:

9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency
1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
2014-03-02 14:27:53 -08:00
Reynold Xin 353ac6b4fa Ignore RateLimitedOutputStreamSuite for now.
This test has been flaky. We can re-enable it after @tdas has a chance to look at it.

Author: Reynold Xin <rxin@apache.org>

Closes #54 from rxin/ratelimit and squashes the following commits:

1a12198 [Reynold Xin] Ignore RateLimitedOutputStreamSuite for now.
2014-03-02 14:27:19 -08:00
Aaron Davidson 46bcb9551e SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
Previously, ZooKeeperPersistenceEngine would crash the whole Master process if
there was stored data from a prior Spark version. Now, we just delete these files.

Author: Aaron Davidson <aaron@databricks.com>

Closes #4 from aarondav/zookeeper2 and squashes the following commits:

fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
2014-03-02 01:00:42 -08:00
Patrick Wendell 1fd2bfd3dd Remove remaining references to incubation
This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #51 from pwendell/tlp and squashes the following commits:

d553b1b [Patrick Wendell] Remove remaining references to incubation
2014-03-02 01:00:16 -08:00
Binh Nguyen b70823c91d Update io.netty from 4.0.13 Final to 4.0.17.Final
This update contains a lot of bug fixes and some new perf improvements.
It is also binary compatible with the current 4.0.13.Final

For more information: http://netty.io/news/2014/02/25/4-0-17-Final.html

Author: Binh Nguyen <ngbinh@gmail.com>

Author: Binh Nguyen <ngbinh@gmail.com>

Closes #41 from ngbinh/master and squashes the following commits:

a9498f4 [Binh Nguyen] update io.netty to 4.0.17.Final
2014-03-02 00:48:50 -08:00
Michael Armbrust 012bd5fbc9 Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
This allows developers to pass options (such as -D) to sbt.  I also modified the SparkBuild to ensure spark specific properties are propagated to forked test JVMs.

Author: Michael Armbrust <michael@databricks.com>

Closes #14 from marmbrus/sbtScripts and squashes the following commits:

c008b18 [Michael Armbrust] Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
2014-03-02 00:35:23 -08:00
DB Tsai 6fc76e49c1 Initialized the regVal for first iteration in SGD optimizer
Ported from https://github.com/apache/incubator-spark/pull/633

In runMiniBatchSGD, the regVal (for 1st iter) should be initialized
as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.

It maybe not be important here for SGD since the updater doesn't take the loss
as parameter to find the new weights. But it will give us the correct history of loss.
However, for LBFGS optimizer we implemented, the correct loss with regVal is crucial to
find the new weights.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #40 from dbtsai/dbtsai-smallRegValFix and squashes the following commits:

77d47da [DB Tsai] In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.
2014-03-02 00:31:59 -08:00
CodingCat 3a8b698e96 [SPARK-1100] prevent Spark from overwriting directory silently
Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100)

the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because

overwriting the data silently is not user-friendly

if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings

My fix includes:

add some new APIs with a flag for users to define whether he/she wants to overwrite the directory:
if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running;

if the flag is set to false, Spark will throw an exception if the output directory already exists

changed JavaAPI part

default behaviour is overwriting

Two questions

should we deprecate the old APIs without such a flag?

I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas

Author: CodingCat <zhunansjtu@gmail.com>

Closes #11 from CodingCat/SPARK-1100 and squashes the following commits:

6a4e3a3 [CodingCat] code clean
ef2d43f [CodingCat] add new test cases and code clean
ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat
ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
2014-03-01 17:27:54 -08:00
CodingCat fe195ae113 [SPARK-1150] fix repo location in create script (re-open)
reopen for https://spark-project.atlassian.net/browse/SPARK-1150

Author: CodingCat <zhunansjtu@gmail.com>

Closes #52 from CodingCat/script_fixes and squashes the following commits:

fc05a71 [CodingCat] fix repo location in create script
2014-03-01 17:24:53 -08:00
Patrick Wendell ec992e1822 Revert "[SPARK-1150] fix repo location in create script"
This reverts commit 9aa0957118.
2014-03-01 17:15:38 -08:00
Mark Grover 9aa0957118 [SPARK-1150] fix repo location in create script
https://spark-project.atlassian.net/browse/SPARK-1150

fix the repo location in create_release script

Author: Mark Grover <mark@apache.org>

Closes #48 from CodingCat/script_fixes and squashes the following commits:

01f4bf7 [Mark Grover] Fixing some nitpicks
d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY
2014-03-01 16:21:22 -08:00
Kay Ousterhout 556c56689b [SPARK-979] Randomize order of offers.
This commit randomizes the order of resource offers to avoid scheduling
all tasks on the same small set of machines.

This is a much simpler solution to SPARK-979 than #7.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #27 from kayousterhout/randomize and squashes the following commits:

435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.
2014-03-01 11:24:22 -08:00