ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Thomas Graves	6112270c94	SPARK-1203 fix saving to hdfs from yarn Author: Thomas Graves <tgraves@apache.org> Closes #173 from tgravescs/SPARK-1203 and squashes the following commits: 4fd5ded [Thomas Graves] adding import 964e3f7 [Thomas Graves] SPARK-1203 fix saving to hdfs from yarn	2014-03-19 08:09:20 -05:00
shiyun.wxm	d55ec86de2	bugfix: Wrong "Duration" in "Active Stages" in stages page If a stage which has completed once loss parts of data, it will be resubmitted. At this time, it appears that stage.completionTime > stage.submissionTime. Author: shiyun.wxm <shiyun.wxm@taobao.com> Closes #170 from BlackNiuza/duration_problem and squashes the following commits: a86d261 [shiyun.wxm] tow space indent c0d7b24 [shiyun.wxm] change the style 3b072e1 [shiyun.wxm] fix scala style f20701e [shiyun.wxm] bugfix: "Duration" in "Active Stages" in stages page	2014-03-19 01:42:34 -07:00
witgo	cc2655a237	Fix SPARK-1256: Master web UI and Worker web UI returns a 404 error Author: witgo <witgo@qq.com> Closes #150 from witgo/SPARK-1256 and squashes the following commits: 08044a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1256 c99b030 [witgo] Fix SPARK-1256	2014-03-18 21:57:47 -07:00
CodingCat	2fa26ec02f	SPARK-1102: Create a saveAsNewAPIHadoopDataset method https://spark-project.atlassian.net/browse/SPARK-1102 Create a saveAsNewAPIHadoopDataset method By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset." Author: CodingCat <zhunansjtu@gmail.com> Closes #12 from CodingCat/SPARK-1102 and squashes the following commits: 6ba0c83 [CodingCat] add test cases for saveAsHadoopDataSet (new&old API) a8d11ba [CodingCat] style fix......... 95a6929 [CodingCat] code clean 7643c88 [CodingCat] change the parameter type back to Configuration a8583ee [CodingCat] Create a saveAsNewAPIHadoopDataset method	2014-03-18 11:06:18 -07:00
Patrick Wendell	e7423d4040	Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225." This reverts commit `ca4bf8c572`. Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin Author: Patrick Wendell <pwendell@gmail.com> Closes #167 from pwendell/jetty-revert and squashes the following commits: 811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."	2014-03-18 00:46:03 -07:00
Dan McClary	e3681f26fa	Spark 1246 add min max to stat counter Here's the addition of min and max to statscounter.py and min and max methods to rdd.py. Author: Dan McClary <dan.mcclary@gmail.com> Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits: fd3fd4b [Dan McClary] fixed error, updated test 82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter 5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark 21dd366 [Dan McClary] added max and min to StatCounter output, updated doc 1a97558 [Dan McClary] added max and min to StatCounter output, updated doc a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py 1e7056d [Dan McClary] added underscore to getBucket 37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived 29981f2 [Dan McClary] fixed indentation on doctest comment eaf89d9 [Dan McClary] added correct doctest for histogram 4916016 [Dan McClary] added histogram method, added max and min to statscounter	2014-03-18 00:45:47 -07:00
Patrick Wendell	796977acdb	SPARK-1244: Throw exception if map output status exceeds frame size This is a very small change on top of @andrewor14's patch in #147. Author: Patrick Wendell <pwendell@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #152 from pwendell/akka-frame and squashes the following commits: e5fb3ff [Patrick Wendell] Reversing test order 393af4c [Patrick Wendell] Small improvement suggested by Andrew Or 8045103 [Patrick Wendell] Breaking out into two tests 2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular 281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests	2014-03-17 14:03:32 -07:00
CodingCat	dc9654638f	SPARK-1240: handle the case of empty RDD when takeSample https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample	2014-03-16 22:14:59 -07:00
Reynold Xin	f5486e9f75	SPARK-1255: Allow user to pass Serializer object instead of class name for shuffle. This is more general than simply passing a string name and leaves more room for performance optimizations. Note that this is technically an API breaking change in the following two ways: 1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark. 2. Serializer's in Spark from now on are required to be serializable. Author: Reynold Xin <rxin@apache.org> Closes #149 from rxin/serializer and squashes the following commits: 5acaccd [Reynold Xin] Properly call serializer's constructors. 2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency. 7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.	2014-03-16 09:57:21 -07:00
Michael Armbrust	e19044cb10	Fix serialization of MutablePair. Also provide an interface for easy updating. Author: Michael Armbrust <michael@databricks.com> Closes #141 from marmbrus/mutablePair and squashes the following commits: f5c4783 [Michael Armbrust] Change function name to update 8bfd973 [Michael Armbrust] Fix serialization of MutablePair. Also provide an interface for easy updating.	2014-03-14 11:40:26 -07:00
Reynold Xin	ca4bf8c572	SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225. Author: Reynold Xin <rxin@apache.org> Closes #113 from rxin/jetty9 and squashes the following commits: 867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file. d7c97ca [Reynold Xin] Return the correctly bound port. d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.	2014-03-13 12:16:04 -07:00
Patrick Wendell	4ea23db0ef	SPARK-1019: pyspark RDD take() throws an NPE Author: Patrick Wendell <pwendell@gmail.com> Closes #112 from pwendell/pyspark-take and squashes the following commits: daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE	2014-03-12 23:16:59 -07:00
CodingCat	6bd2eaa4a5	hot fix for PR105 - change to Java annotation Author: CodingCat <zhunansjtu@gmail.com> Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits: 6607155 [CodingCat] hot fix for PR105 - change to Java annotation	2014-03-12 19:49:18 -07:00
CodingCat	9032f7c0d5	SPARK-1160: Deprecate toArray in RDD https://spark-project.atlassian.net/browse/SPARK-1160 reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python." In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly Author: CodingCat <zhunansjtu@gmail.com> Closes #105 from CodingCat/SPARK-1060 and squashes the following commits: 286f163 [CodingCat] deprecate in JavaRDDLike ee17b4e [CodingCat] add message and since 2ff7319 [CodingCat] deprecate toArray in RDD	2014-03-12 17:43:12 -07:00
liguoqiang	5d1ec64e79	Fix #SPARK-1149 Bad partitioners can cause Spark to hang Author: liguoqiang <liguoqiang@rd.tuan800.com> Closes #44 from witgo/SPARK-1149 and squashes the following commits: 3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149 8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 3dad595 [liguoqiang] review comment e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149 b0d5c07 [liguoqiang] review comment d0a6005 [liguoqiang] review comment 3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 ac006a3 [liguoqiang] code Formatting 3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149 adc443e [liguoqiang] partitions check bugfix 928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149 1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149 3348619 [liguoqiang] Optimize performance for partitions check 61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149 e68210a [liguoqiang] add partition index check to submitJob 3a65903 [liguoqiang] make the code more readable 6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang	2014-03-12 13:00:04 -07:00
Thomas Graves	c8c59b326e	[SPARK-1232] Fix the hadoop 0.23 yarn build Author: Thomas Graves <tgraves@apache.org> Closes #127 from tgravescs/SPARK-1232 and squashes the following commits: c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build	2014-03-12 10:32:01 -07:00
Patrick Wendell	16788a6542	SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues... This patch removes Ganglia integration from the default build. It allows users willing to link against LGPL code to use Ganglia by adding build flags or linking against a new Spark artifact called spark-ganglia-lgpl. This brings Spark in line with the Apache policy on LGPL code enumerated here: https://www.apache.org/legal/3party.html#options-optional Author: Patrick Wendell <pwendell@gmail.com> Closes #108 from pwendell/ganglia and squashes the following commits: 326712a [Patrick Wendell] Responding to review feedback 5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.	2014-03-11 11:16:59 -07:00
Patrick Wendell	2a5161708f	SPARK-1205: Clean up callSite/origin/generator. This patch removes the `generator` field and simplifies + documents the tracking of callsites. There are two places where we care about call sites, when a job is run and when an RDD is created. This patch retains both of those features but does a slight refactoring and renaming to make things less confusing. There was another feature of an rdd called the `generator` which was by default the user class that in which the RDD was created. This is used exclusively in the JobLogger. It been subsumed by the ability to name a job group. The job logger can later be refectored to read the job group directly (will require some work) but for now this just preserves the default logged value of the user class. I'm not sure any users ever used the ability to override this. Author: Patrick Wendell <pwendell@gmail.com> Closes #106 from pwendell/callsite and squashes the following commits: fc1d009 [Patrick Wendell] Compile fix e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite 62e77ef [Patrick Wendell] Review feedback 576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.	2014-03-10 16:28:41 -07:00
Patrick Wendell	b9be160951	SPARK-782 Clean up for ASM dependency. This makes two changes. 1) Spark uses the shaded version of asm that is (conveniently) published with Kryo. 2) Existing exclude rules around asm are updated to reflect the new groupId of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop versions that pull in new asm versions. Author: Patrick Wendell <pwendell@gmail.com> Closes #100 from pwendell/asm and squashes the following commits: 9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.	2014-03-09 13:17:07 -07:00
Jiacheng Guo	f6f9d02e85	Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set. Author: Jiacheng Guo <guojc03@gmail.com> Closes #98 from guojc/master and squashes the following commits: abfe698 [Jiacheng Guo] add space according request 2a37c34 [Jiacheng Guo] Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set.	2014-03-09 11:38:40 -07:00
Aaron Davidson	52834d761b	SPARK-929: Fully deprecate usage of SPARK_MEM (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615) This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables: SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public. SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property. SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory. Other memory considerations: - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY. - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class). This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however. Author: Aaron Davidson <aaron@databricks.com> Closes #99 from aarondav/sparkmem and squashes the following commits: 9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM	2014-03-09 11:08:39 -07:00
Patrick Wendell	e59a3b6c41	SPARK-1190: Do not initialize log4j if slf4j log4j backend is not being used Author: Patrick Wendell <pwendell@gmail.com> Closes #107 from pwendell/logging and squashes the following commits: be21c11 [Patrick Wendell] Logging fix	2014-03-08 16:02:42 -08:00
Cheng Lian	0b7b7fd45c	[SPARK-1194] Fix the same-RDD rule for cache replacement SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194 In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails and aborts. In this PR, we keep selecting blocks not from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #96 from liancheng/fix-spark-1194 and squashes the following commits: 2524ab9 [Cheng Lian] Added regression test case for SPARK-1194 6e40c22 [Cheng Lian] Remove redundant comments 40cdcb2 [Cheng Lian] Bug fix, and addressed PR comments from @mridulm 62c92ac [Cheng Lian] Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194	2014-03-07 23:26:46 -08:00
Sandy Ryza	a99fb3747a	SPARK-1193. Fix indentation in pom.xmls Author: Sandy Ryza <sandy@cloudera.com> Closes #91 from sryza/sandy-spark-1193 and squashes the following commits: a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls	2014-03-07 23:10:35 -08:00
Prashant Sharma	6e730edcde	Spark 1165 rdd.intersection in python and java Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits: 9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection. 1fea813 [Prashant Sharma] correct the lines wrapping d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.	2014-03-07 18:48:07 -08:00
Thomas Graves	b7cd9e992c	SPARK-1195: set map_input_file environment variable in PipedRDD Hadoop uses the config mapreduce.map.input.file to indicate the input filename to the map when the input split is of type FileSplit. Some of the hadoop input and output formats set or use this config. This config can also be used by user code. PipedRDD runs an external process and the configs aren't available to that process. Hadoop Streaming does something very similar and the way they make configs available is exporting them into the environment replacing '.' with '_'. Spark should also export this variable when launching the pipe command so the user code has access to that config. Note that the config mapreduce.map.input.file is the new one, the old one which is deprecated but not yet removed is map.input.file. So we should handle both. Perhaps it would be better to abstract this out somehow so it goes into the HadoopParition code? Author: Thomas Graves <tgraves@apache.org> Closes #94 from tgravescs/map_input_file and squashes the following commits: cc97a6a [Thomas Graves] Update test to check for existence of command, add a getPipeEnvVars function to HadoopRDD e3401dc [Thomas Graves] Merge remote-tracking branch 'upstream/master' into map_input_file 2ba805e [Thomas Graves] set map_input_file environment variable in PipedRDD	2014-03-07 10:36:55 -08:00
Aaron Davidson	dabeb6f160	SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1 This patch allows the FaultToleranceTest to work in newer versions of Docker. See https://spark-project.atlassian.net/browse/SPARK-1136 for more details. Besides changing the Docker and FaultToleranceTest internals, this patch also changes the behavior of Master to accept new Workers which share an address with a Worker that we are currently trying to recover. This can only happen when the Worker itself was restarted and got the same IP address/port at the same time as a Master recovery occurs. Finally, this adds a good bit of ASCII art to the test to make failures, successes, and actions more apparent. This is very much needed. Author: Aaron Davidson <aaron@databricks.com> Closes #5 from aarondav/zookeeper and squashes the following commits: 5d7a72a [Aaron Davidson] SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1	2014-03-07 10:22:27 -08:00
Patrick Wendell	33baf14b04	Small clean-up to flatmap tests	2014-03-06 17:57:31 -08:00
Sandy Ryza	328c73d037	SPARK-1197. Change yarn-standalone to yarn-cluster and fix up running on YARN docs This patch changes "yarn-standalone" to "yarn-cluster" (but still supports the former). It also cleans up the Running on YARN docs and adds a section on how to view logs. Author: Sandy Ryza <sandy@cloudera.com> Closes #95 from sryza/sandy-spark-1197 and squashes the following commits: 563ef3a [Sandy Ryza] Review feedback 6ad06d4 [Sandy Ryza] Change yarn-standalone to yarn-cluster and fix up running on YARN docs	2014-03-06 17:12:58 -08:00
Thomas Graves	7edbea41b4	SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets resubmit pull request. was https://github.com/apache/incubator-spark/pull/332. Author: Thomas Graves <tgraves@apache.org> Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits: dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser 05eebed [Thomas Graves] Fix dependency lost in upmerge d1040ec [Thomas Graves] Fix up various imports 05ff5e0 [Thomas Graves] Fix up imports after upmerging to master ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase 13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests. 4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets 2f77147 [Thomas Graves] Rework from comments 50dd9f2 [Thomas Graves] fix header in SecurityManager ecbfb65 [Thomas Graves] Fix spacing and formatting b514bec [Thomas Graves] Fix reference to config ed3d1c1 [Thomas Graves] Add security.md 6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments 2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework 5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets	2014-03-06 18:27:50 -06:00
Kyle Ellrott	40566e10aa	SPARK-942: Do not materialize partitions when DISK_ONLY storage level is used This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180 Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer. To do this, two changes where made: 1) The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly. 2) The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions. Author: Kyle Ellrott <kellrott@gmail.com> Closes #50 from kellrott/iterator-to-disk and squashes the following commits: 9ef7cb8 [Kyle Ellrott] Fixing formatting issues. 60e0c57 [Kyle Ellrott] Fixing issues (formatting, variable names, etc.) from review comments 8aa31cd [Kyle Ellrott] Merge ../incubator-spark into iterator-to-disk 33ac390 [Kyle Ellrott] Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spark into iterator-to-disk 2f684ea [Kyle Ellrott] Refactoring the BlockManager to replace the Either[Either[A,B]] usage. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues. f70d069 [Kyle Ellrott] Adding docs for spark.serializer.objectStreamReset configuration 7ccc74b [Kyle Ellrott] Moving the 'LargeIteratorSuite' to simply test persistance of iterators. It doesn't try to invoke a OOM error any more 16a4cea [Kyle Ellrott] Streamlined the LargeIteratorSuite unit test. It should now run in ~25 seconds. Confirmed that it still crashes an unpatched copy of Spark. c2fb430 [Kyle Ellrott] Removing more un-needed array-buffer to iterator conversions 627a8b7 [Kyle Ellrott] Wrapping a few long lines 0f28ec7 [Kyle Ellrott] Adding second putValues to BlockStore interface that accepts an ArrayBuffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication. 656c33e [Kyle Ellrott] Fixing the JavaSerializer to read from the SparkConf rather then the System property. 8644ee8 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 00c98e0 [Kyle Ellrott] Making the Java ObjectStreamSerializer reset rate configurable by the system variable 'spark.serializer.objectStreamReset', default is not 10000. 40fe1d7 [Kyle Ellrott] Removing rouge space 31fe08e [Kyle Ellrott] Removing un-needed semi-colons 9df0276 [Kyle Ellrott] Added check to make sure that streamed-to-dist RDD actually returns good data in the LargeIteratorSuite a6424ba [Kyle Ellrott] Wrapping long line 2eeda75 [Kyle Ellrott] Fixing dumb mistake ("\|\|" instead of "&&") 0e6f808 [Kyle Ellrott] Deleting temp output directory when done 95c7f67 [Kyle Ellrott] Simplifying StorageLevel checks 56f71cd [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 44ec35a [Kyle Ellrott] Adding some comments. 5eb2b7e [Kyle Ellrott] Changing the JavaSerializer reset to occur every 1000 objects. f403826 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 81d670c [Kyle Ellrott] Adding unit test for straight to disk iterator methods. d32992f [Kyle Ellrott] Merge remote-tracking branch 'origin/master' into iterator-to-disk cac1fad [Kyle Ellrott] Fixing MemoryStore, so that it converts incoming iterators to ArrayBuffer objects. This was previously done higher up the stack. efe1102 [Kyle Ellrott] Changing CacheManager and BlockManager to pass iterators directly to the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.	2014-03-06 14:51:19 -08:00
CodingCat	a3da508819	SPARK-1171: when executor is removed, we should minus totalCores instead of just freeCores on that executor https://spark-project.atlassian.net/browse/SPARK-1171 When the executor is removed, the current implementation will only minus the freeCores of that executor. Actually we should minus the totalCores... Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes #63 from CodingCat/simplify_CoarseGrainedSchedulerBackend and squashes the following commits: f6bf93f [Nan Zhu] code clean 19c2bb4 [CodingCat] use copy idiom to reconstruct the workerOffers 43c13e9 [CodingCat] keep WorkerOffer immutable af470d3 [CodingCat] style fix 0c0e409 [CodingCat] simplify the implementation of CoarseGrainedSchedulerBackend	2014-03-05 14:00:28 -08:00
Prashant Sharma	2d8e0a062c	SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally Author: Prashant Sharma <prashant.s@imaginea.com> Closes #72 from ScrapCodes/SPARK-1164/deprecate-reducebykeytodriver and squashes the following commits: ee521cd [Prashant Sharma] SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally	2014-03-04 10:27:02 -08:00
Prashant Sharma	181ec50307	[java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits: 95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch. 85a954e [Prashant Sharma] Nit. import orderings. 673f7ac [Prashant Sharma] Added support for -java-home as well 80a13e8 [Prashant Sharma] Used fake class tag syntax 26eb3f6 [Prashant Sharma] Patrick's comments on PR. 35d8d79 [Prashant Sharma] Specified java 8 building in the docs 31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag. 4ab87d3 [Prashant Sharma] Review feedback on the pr c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.	2014-03-03 22:31:30 -08:00
Kay Ousterhout	b14ede789a	Remove broken/unused Connection.getChunkFIFO method. This method appears to be broken -- since it never removes anything from messages, and it adds new messages to it, the while loop is an infinite loop. The method also does not appear to have ever been used since the code was added in 2012, so this commit removes it. cc @mateiz who originally added this method in case there's a reason it should be here! (`63051dd2bc`) Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #69 from kayousterhout/remove_get_fifo and squashes the following commits: 053bc59 [Kay Ousterhout] Remove broken/unused Connection.getChunkFIFO method.	2014-03-03 21:27:18 -08:00
Bryn Keller	923dba5096	Added a unit test for PairRDDFunctions.lookup Lookup didn't have a unit test. Added two tests, one for with a partitioner, and one for without. Author: Bryn Keller <bryn.keller@intel.com> Closes #36 from xoltar/lookup and squashes the following commits: 3bc0d44 [Bryn Keller] Added a unit test for PairRDDFunctions.lookup	2014-03-03 16:38:57 -08:00
Kay Ousterhout	b55cade853	Remove the remoteFetchTime metric. This metric is confusing: it adds up all of the time to fetch shuffle inputs, but fetches often happen in parallel, so remoteFetchTime can be much longer than the task execution time. @squito it looks like you added this metric -- do you have a use case for it? cc @shivaram -- I know you've looked at the shuffle performance a lot so chime in here if this metric has turned out to be useful for you! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #62 from kayousterhout/remove_fetch_variable and squashes the following commits: 43341eb [Kay Ousterhout] Remote the remoteFetchTime metric.	2014-03-03 16:12:00 -08:00
Kay Ousterhout	369aad6f9e	Removed accidentally checked in comment It looks like this comment was added a while ago by @mridulm as part of a merge and was accidentally checked in. We should remove it. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #61 from kayousterhout/remove_comment and squashes the following commits: 0b2b3f2 [Kay Ousterhout] Removed accidentally checked in comment	2014-03-03 14:39:49 -08:00
Patrick Wendell	c3f5e07533	SPARK-1121: Include avro for yarn-alpha builds This lets us explicitly include Avro based on a profile for 0.23.X builds. It makes me sad how convoluted it is to express this logic in Maven. @tgraves and @sryza curious if this works for you. I'm also considering just reverting to how it was before. The only real problem was that Spark advertised a dependency on Avro even though it only really depends transitively on Avro through other deps. Author: Patrick Wendell <pwendell@gmail.com> Closes #49 from pwendell/avro-build-fix and squashes the following commits: 8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile	2014-03-02 15:18:19 -08:00
Sean Owen	fd31adbf27	SPARK-1084.2 (resubmitted) (Ported from https://github.com/apache/incubator-spark/pull/650 ) This adds one more change though, to fix the scala version warning introduced by json4s recently. Author: Sean Owen <sowen@cloudera.com> Closes #32 from srowen/SPARK-1084.2 and squashes the following commits: 9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency 1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway	2014-03-02 14:27:53 -08:00
Aaron Davidson	46bcb9551e	SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID Previously, ZooKeeperPersistenceEngine would crash the whole Master process if there was stored data from a prior Spark version. Now, we just delete these files. Author: Aaron Davidson <aaron@databricks.com> Closes #4 from aarondav/zookeeper2 and squashes the following commits: fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID	2014-03-02 01:00:42 -08:00
Patrick Wendell	1fd2bfd3dd	Remove remaining references to incubation This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier. Author: Patrick Wendell <pwendell@gmail.com> Closes #51 from pwendell/tlp and squashes the following commits: d553b1b [Patrick Wendell] Remove remaining references to incubation	2014-03-02 01:00:16 -08:00
CodingCat	3a8b698e96	[SPARK-1100] prevent Spark from overwriting directory silently Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100) the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because overwriting the data silently is not user-friendly if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings My fix includes: add some new APIs with a flag for users to define whether he/she wants to overwrite the directory: if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running; if the flag is set to false, Spark will throw an exception if the output directory already exists changed JavaAPI part default behaviour is overwriting Two questions should we deprecate the old APIs without such a flag? I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas Author: CodingCat <zhunansjtu@gmail.com> Closes #11 from CodingCat/SPARK-1100 and squashes the following commits: 6a4e3a3 [CodingCat] code clean ef2d43f [CodingCat] add new test cases and code clean ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory	2014-03-01 17:27:54 -08:00
Patrick Wendell	ec992e1822	Revert "[SPARK-1150] fix repo location in create script" This reverts commit `9aa0957118`.	2014-03-01 17:15:38 -08:00
Mark Grover	9aa0957118	[SPARK-1150] fix repo location in create script https://spark-project.atlassian.net/browse/SPARK-1150 fix the repo location in create_release script Author: Mark Grover <mark@apache.org> Closes #48 from CodingCat/script_fixes and squashes the following commits: 01f4bf7 [Mark Grover] Fixing some nitpicks d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY	2014-03-01 16:21:22 -08:00
Kay Ousterhout	556c56689b	[SPARK-979] Randomize order of offers. This commit randomizes the order of resource offers to avoid scheduling all tasks on the same small set of machines. This is a much simpler solution to SPARK-979 than #7. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #27 from kayousterhout/randomize and squashes the following commits: 435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.	2014-03-01 11:24:22 -08:00
Sandy Ryza	46dff34458	SPARK-1051. On YARN, executors don't doAs submitting user This reopens https://github.com/apache/incubator-spark/pull/538 against the new repo Author: Sandy Ryza <sandy@cloudera.com> Closes #29 from sryza/sandy-spark-1051 and squashes the following commits: 708ce49 [Sandy Ryza] SPARK-1051. doAs submitting user in YARN	2014-02-28 12:43:01 -06:00
Kay Ousterhout	edf8a56ab7	Remote BlockFetchTracker trait This trait seems to have been created a while ago when there were multiple implementations; now that there's just one, I think it makes sense to merge it into the BlockFetcherIterator trait. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #39 from kayousterhout/remove_tracker and squashes the following commits: 8173939 [Kay Ousterhout] Remote BlockFetchTracker.	2014-02-27 21:52:55 -08:00
Patrick Wendell	c42557be32	[HOTFIX] Patching maven build after #6 (SPARK-1121). That patch removed the Maven avro declaration but didn't remove the actual dependency in core. /cc @scrapcodes Author: Patrick Wendell <pwendell@gmail.com> Closes #37 from pwendell/master and squashes the following commits: 0ef3008 [Patrick Wendell] [HOTFIX] Patching maven build after #6 (SPARK-1121).	2014-02-27 15:06:20 -08:00
Sean Owen	12bbca2065	SPARK 1084.1 (resubmitted) (Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>	2014-02-27 11:12:21 -08:00
Raymond Liu	aace2c097e	Show Master status on UI page For standalone HA mode, A status is useful to identify the current master, already in json format too. Author: Raymond Liu <raymond.liu@intel.com> Closes #24 from colorant/status and squashes the following commits: df630b3 [Raymond Liu] Show Master status on UI page	2014-02-26 23:51:32 -08:00
Xiangrui Meng	5a3ad107c0	SPARK-1129: use a predefined seed when seed is zero in XORShiftRandom If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark-project.atlassian.net/browse/SPARK-1129 Author: Xiangrui Meng <meng@databricks.com> Closes #645 from mengxr/xor and squashes the following commits: 1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom 45c6f16 [Xiangrui Meng] minor style change 51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom	2014-02-26 23:22:30 -08:00
Kay Ousterhout	71f69d66ce	Remove references to ClusterScheduler (SPARK-1140) ClusterScheduler was renamed to TaskSchedulerImpl; this commit updates comments and tests accordingly. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits: d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.	2014-02-26 22:52:42 -08:00
Prashant Sharma	0e40e2b126	Deprecated and added a few java api methods for corresponding scala api. PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits: 11d0c2b [Prashant Sharma] Integer -> java.lang.Integer 737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs. 3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs	2014-02-26 21:17:44 -08:00
William Benton	fbedc8eff2	SPARK-1078: Replace lift-json with json4s-jackson. The aim of the Json4s project is to provide a common API for Scala JSON libraries. It is Apache-licensed, easier for downstream distributions to package, and mostly API-compatible with lift-json. Furthermore, the Jackson-backed implementation parses faster than lift-json on all but the smallest inputs. Author: William Benton <willb@redhat.com> Closes #582 from willb/json4s and squashes the following commits: 7ca62c4 [William Benton] Replace lift-json with json4s-jackson.	2014-02-26 10:09:50 -08:00
Raymond Liu	c852201ce9	For SPARK-1082, Use Curator for ZK interaction in standalone cluster Author: Raymond Liu <raymond.liu@intel.com> Closes #611 from colorant/curator and squashes the following commits: 7556aa1 [Raymond Liu] Address review comments af92e1f [Raymond Liu] Fix coding style 964f3c2 [Raymond Liu] Ignore NodeExists exception 6df2966 [Raymond Liu] Rewrite zookeeper client code with curator	2014-02-24 23:20:38 -08:00
Bryn Keller	4d88030486	For outputformats that are Configurable, call setConf before sending data to them. [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. Note this bug also affects branch-0.9 Author: Bryn Keller <bryn.keller@intel.com> Closes #638 from xoltar/SPARK-1108 and squashes the following commits: 7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review 7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured	2014-02-24 17:35:22 -08:00
Matei Zaharia	0187cef0f2	Fix removal from shuffleToMapStage to search for a key-value pair with our stage instead of using our shuffleID.	2014-02-24 13:14:56 -08:00
Matei Zaharia	cd32d5e4de	SPARK-1124: Fix infinite retries of reduce stage when a map stage failed In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null".	2014-02-23 23:48:32 -08:00
Sean Owen	c0ef3afa82	SPARK-1071: Tidy logging strategy and use of log4j Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger. Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j. This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should: - Exclude dependencies on log4j, slf4j-log4j12 from Spark - Include dependency on log4j-over-slf4j - Include dependency on another logger X, and another slf4j-X - Recreate any log config that Spark does, that is needed, in the other logger's config That sounds about right. Here are the key changes: - Include the jcl-over-slf4j shim everywhere by depending on it in core. - Exclude dependencies on commons-logging from third-party libraries. - Include the jul-to-slf4j shim everywhere by depending on it in core. - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests And minor/incidental changes: - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala) - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me) Author: Sean Owen <sowen@cloudera.com> Closes #570 from srowen/SPARK-1071 and squashes the following commits: 52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core. 77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j	2014-02-23 11:40:55 -08:00
Punya Biswal	29ac7ea52f	Migrate Java code to Scala or move it to src/main/java These classes can't be migrated: StorageLevels: impossible to create static fields in Scala JavaSparkContextVarargsWorkaround: incompatible varargs JavaAPISuite: should test Java APIs in pure Java (for sanity) Author: Punya Biswal <pbiswal@palantir.com> Closes #605 from punya/move-java-sources and squashes the following commits: 25b00b2 [Punya Biswal] Remove redundant type param; reformat 853da46 [Punya Biswal] Use factory method rather than constructor e5d53d9 [Punya Biswal] Migrate Java code to Scala or move it to src/main/java	2014-02-22 17:53:48 -08:00
Xiangrui Meng	aaec7d4a80	SPARK-1117: update accumulator docs The current doc hints spark doesn't support accumulators of type `Long`, which is wrong. JIRA: https://spark-project.atlassian.net/browse/SPARK-1117 Author: Xiangrui Meng <meng@databricks.com> Closes #631 from mengxr/acc and squashes the following commits: 45ecd25 [Xiangrui Meng] update accumulator docs	2014-02-21 22:44:45 -08:00
Andrew Or	fefd22f4c3	[SPARK-1113] External spilling - fix Int.MaxValue hash code collision bug The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612. ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException. The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again. This PR also includes two new tests for hash collisions. Author: Andrew Or <andrewor14@gmail.com> Closes #624 from andrewor14/spilling-bug and squashes the following commits: 9e7263d [Andrew Or] Slightly optimize next() 2037ae2 [Andrew Or] Move a few comments around... cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap 21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite	2014-02-21 20:05:39 -08:00
Patrick Wendell	45b15e27a8	SPARK-1111: URL Validation Throws Error for HDFS URL's Fixes an error where HDFS URL's cause an exception. Should be merged into master and 0.9. Author: Patrick Wendell <pwendell@gmail.com> Closes #625 from pwendell/url-validation and squashes the following commits: d14bfe3 [Patrick Wendell] SPARK-1111: URL Validation Throws Error for HDFS URL's	2014-02-21 11:11:55 -08:00
Aaron Davidson	3fede4831e	Super minor: Add require for mergeCombiners in combineByKey We changed the behavior in 0.9.0 from requiring that mergeCombiners be null when mapSideCombine was false to requiring that mergeCombiners never be null, for external sorting. This patch adds a require() to make this behavior change explicitly messaged rather than resulting in a NPE. Author: Aaron Davidson <aaron@databricks.com> Closes #623 from aarondav/master and squashes the following commits: 520b80c [Aaron Davidson] Super minor: Add require for mergeCombiners in combineByKey	2014-02-20 16:46:13 -08:00
NirmalReddy	ccb327a49a	Optimized imports Optimized imports and arranged according to scala style guide @ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports Author: NirmalReddy <nirmal.reddy@imaginea.com> Author: NirmalReddy <nirmal_reddy2000@yahoo.com> Closes #613 from NirmalReddy/opt-imports and squashes the following commits: 578b4f5 [NirmalReddy] imported java.lang.Double as JDouble a2cbcc5 [NirmalReddy] addressed the comments 776d664 [NirmalReddy] Optimized imports in core	2014-02-18 14:44:36 -08:00
Aaron Davidson	f74ae0ebce	SPARK-1098: Minor cleanup of ClassTag usage in Java API Our usage of fake ClassTags in this manner is probably not healthy, but I'm not sure if there's a better solution available, so I just cleaned up and documented the current one. Author: Aaron Davidson <aaron@databricks.com> Closes #604 from aarondav/master and squashes the following commits: b398e89 [Aaron Davidson] SPARK-1098: Minor cleanup of ClassTag usage in Java API	2014-02-17 19:23:27 -08:00
Andrew Ash	c0795cf481	Worker registration logging fix Author: Andrew Ash <andrew@andrewash.com> Closes #608 from ash211/patch-7 and squashes the following commits: bd85f2a [Andrew Ash] Worker registration logging fix	2014-02-17 09:51:55 -08:00
Punya Biswal	5af4477c2b	Add subtractByKey to the JavaPairRDD wrapper Author: Punya Biswal <pbiswal@palantir.com> Closes #600 from punya/subtractByKey-java and squashes the following commits: e961913 [Punya Biswal] Hide implicit ClassTags from Java API c5d317b [Punya Biswal] Add subtractByKey to the JavaPairRDD wrapper	2014-02-16 18:55:59 -08:00
Bijay Bisht	73cfdcfe71	fix for https://spark-project.atlassian.net/browse/SPARK-1052 Author: Bijay Bisht <bijay.bisht@gmail.com> Closes #568 from bijaybisht/SPARK-1052 and squashes the following commits: da70395 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 - comments incorporated fdb1d94 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 (cherry picked from commit `e797c1abd9`) Signed-off-by: Aaron Davidson <aaron@databricks.com>	2014-02-16 16:54:03 -08:00
CodingCat	1cad381387	[SPARK-1092] print warning information if user use SPARK_MEM to regulate executor memory usage https://spark-project.atlassian.net/browse/SPARK-1092?jql=project%20%3D%20SPARK print warning information if user set SPARK_MEM to regulate memory usage of executors ---- OUTDATED: Currently, users will usually set SPARK_MEM to control the memory usage of driver programs, (in spark-class) 91 JAVA_OPTS="$OUR_JAVA_OPTS" 92 JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH" 93 JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM" if they didn't set spark.executor.memory, the value in this environment variable will also affect the memory usage of executors, because the following lines in SparkContext privatespark val executorMemory = conf.getOption("spark.executor.memory") .orElse(Option(System.getenv("SPARK_MEM"))) .map(Utils.memoryStringToMb) .getOrElse(512) also since SPARK_MEM has been (proposed to) deprecated in SPARK-929 (https://spark-project.atlassian.net/browse/SPARK-929) and the corresponding PR (https://github.com/apache/incubator-spark/pull/104) we should remove this line Author: CodingCat <zhunansjtu@gmail.com> Closes #602 from CodingCat/clean_spark_mem and squashes the following commits: 302bb28 [CodingCat] print warning information if user use SPARK_MEM to regulate executor memory usage	2014-02-16 12:25:38 -08:00
Xiangrui Meng	7e29e02791	Merge pull request #591 from mengxr/transient-new. SPARK-1076: [Fix #578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <meng@databricks.com> Closes #591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD	2014-02-12 16:26:25 -08:00
Xiangrui Meng	2bea0709f9	Merge pull request #589 from mengxr/index. SPARK-1076: Convert Int to Long to avoid overflow Patch for PR #578. Author: Xiangrui Meng <meng@databricks.com> Closes #589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow	2014-02-12 10:47:52 -08:00
Xiangrui Meng	e733d655df	Merge pull request #578 from mengxr/rank. SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <meng@databricks.com> Closes #578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD	2014-02-12 00:42:42 -08:00
Raymond Liu	68b2c0d02d	Merge pull request #583 from colorant/zookeeper. Minor fix for ZooKeeperPersistenceEngine to use configured working dir Author: Raymond Liu <raymond.liu@intel.com> Closes #583 and squashes the following commits: 91b0609 [Raymond Liu] Minor fix for ZooKeeperPersistenceEngine to use configured working dir	2014-02-11 22:39:48 -08:00
Holden Karau	b0dab1bb9f	Merge pull request #571 from holdenk/switchtobinarysearch. SPARK-1072 Use binary search when needed in RangePartioner Author: Holden Karau <holden@pigscanfly.ca> Closes #571 and squashes the following commits: f31a2e1 [Holden Karau] Swith to using CollectionsUtils in Partitioner 4c7a0c3 [Holden Karau] Add CollectionsUtil as suggested by aarondav 7099962 [Holden Karau] Add the binary search to only init once 1bef01d [Holden Karau] CR feedback a21e097 [Holden Karau] Use binary search if we have more than 1000 elements inside of RangePartitioner	2014-02-11 14:48:59 -08:00
Patrick Wendell	d6a9bdc097	Revert "Merge pull request #560 from pwendell/logging. Closes #560." This reverts commit `b6d40b7823`.	2014-02-09 23:35:06 -08:00
Prashant Sharma	919bd7f669	Merge pull request #567 from ScrapCodes/style2. SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2 Continuation of PR #557 With this all scala style errors are fixed across the code base !! The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #567 and squashes the following commits: 3b1ec30 [Prashant Sharma] scala style fixes	2014-02-09 22:17:52 -08:00
qqsun8819	afc8f3cb9a	Merge pull request #551 from qqsun8819/json-protocol. [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself This is a PR for SPARK-1038. Two major changes: 1 add some fields to JsonProtocol which is new and important to standalone-related data structures 2 Use Diff in liftweb.json to verity the stringified Json output for detecting someone mod type T to Option[T] Author: qqsun8819 <jin.oyj@alibaba-inc.com> Closes #551 and squashes the following commits: fdf0b4e [qqsun8819] [SPARK-1038] 1. Change code style for more readable according to rxin review 2. change submitdate hard-coded string to a date object toString for more complexiblity 095a26f [qqsun8819] [SPARK-1038] mod according to review of pwendel, use hard-coded json string for json data validation. Each test use its own json string 0524e41 [qqsun8819] Merge remote-tracking branch 'upstream/master' into json-protocol d203d5c [qqsun8819] [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself	2014-02-09 13:57:29 -08:00
Patrick Wendell	b69f8b2a01	Merge pull request #557 from ScrapCodes/style. Closes #557 . SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <scrapcodes@gmail.com> == Merge branch commits == commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4 Author: Prashant Sharma <scrapcodes@gmail.com> Date: Sun Feb 9 17:39:07 2014 +0530 scala style fixes commit f91709887a8e0b608c5c2b282db19b8a44d53a43 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Jan 24 11:22:53 2014 -0800 Adding scalastyle snapshot	2014-02-09 10:09:19 -08:00
CodingCat	b6dba10ae5	Merge pull request #556 from CodingCat/JettyUtil. Closes #556 . [SPARK-1060] startJettyServer should explicitly use IP information https://spark-project.atlassian.net/browse/SPARK-1060 In the current implementation, the webserver in Master/Worker is started with val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers) inside startJettyServer: val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address Author: CodingCat <zhunansjtu@gmail.com> == Merge branch commits == commit 6c6d9a8ccc9ec4590678a3b34cb03df19092029d Author: CodingCat <zhunansjtu@gmail.com> Date: Thu Feb 6 14:53:34 2014 -0500 startJettyServer should explicitly use IP information	2014-02-08 23:39:17 -08:00
Patrick Wendell	b6d40b7823	Merge pull request #560 from pwendell/logging. Closes #560 . [WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j To fix this - we add a check when initializing log4j. Author: Patrick Wendell <pwendell@gmail.com> == Merge branch commits == commit ffdce513877f64b6eed6d36138c3e0003d392889 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Feb 7 15:22:29 2014 -0800 Logging fix	2014-02-08 23:35:31 -08:00
Mark Hamstra	c2341c92bb	Merge pull request #542 from markhamstra/versionBump. Closes #542 . Version number to 1.0.0-SNAPSHOT Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore. @pwendell Author: Mark Hamstra <markhamstra@gmail.com> == Merge branch commits == commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71 Author: Mark Hamstra <markhamstra@gmail.com> Date: Wed Feb 5 09:30:32 2014 -0800 Version number to 1.0.0-SNAPSHOT	2014-02-08 16:00:43 -08:00
Qiuzhuang Lian	f0ce736fad	Merge pull request #561 from Qiuzhuang/master. Closes #561 . Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068 Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com> == Merge branch commits == commit 9c19ce63637eee9369edd235979288d3d9fc9105 Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com> Date: Sat Feb 8 16:07:39 2014 +0800 Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068	2014-02-08 12:59:48 -08:00
Andrew Ash	3a9d82cc9e	Merge pull request #506 from ash211/intersection. Closes #506 . SPARK-1062 Add rdd.intersection(otherRdd) method Author: Andrew Ash <andrew@andrewash.com> == Merge branch commits == commit 5d9982b171b9572649e9828f37ef0b43f0242912 Author: Andrew Ash <andrew@andrewash.com> Date: Thu Feb 6 18:11:45 2014 -0800 Minor fixes - style: (v,null) => (v, null) - mention the shuffle in Javadoc commit b86d02f14e810902719cef893cf6bfa18ff9acb0 Author: Andrew Ash <andrew@andrewash.com> Date: Sun Feb 2 13:17:40 2014 -0800 Overload .intersection() for numPartitions and custom Partitioner commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09 Author: Andrew Ash <andrew@andrewash.com> Date: Sun Feb 2 13:05:40 2014 -0800 Better naming of parameters in intersection's filter commit b10a6af2d793ec6e9a06c798007fac3f6b860d89 Author: Andrew Ash <andrew@andrewash.com> Date: Sat Jan 25 23:06:26 2014 -0800 Follow spark code format conventions of tab => 2 spaces commit 965256e4304cca514bb36a1a36087711dec535ec Author: Andrew Ash <andrew@andrewash.com> Date: Fri Jan 24 00:28:01 2014 -0800 Add rdd.intersection(otherRdd) method	2014-02-06 22:39:08 -08:00
Andrew Or	1896c6e7c9	Merge pull request #533 from andrewor14/master. Closes #533 . External spilling - generalize batching logic The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way. Author: Andrew Or <andrewor14@gmail.com> == Merge branch commits == commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 12:09:32 2014 -0800 Also privatize fields commit 090544a87a0767effd0c835a53952f72fc8d24f0 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 10:58:23 2014 -0800 Privatize methods commit 13920c918efe22e66a1760b14beceb17a61fd8cc Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 16:34:15 2014 -0800 Update docs commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:44:24 2014 -0800 Typo: phyiscal -> physical commit 287ef44e593ad72f7434b759be3170d9ee2723d2 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:38:32 2014 -0800 Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit 3df700509955f7074821e9aab1e74cb53c58b5a5 Merge: a531d2e 164489d Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:27:49 2014 -0800 Merge branch 'master' of github.com:andrewor14/incubator-spark commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8 Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF.	2014-02-06 22:05:53 -08:00
Kay Ousterhout	0b448df6ac	Merge pull request #450 from kayousterhout/fetch_failures. Closes #450 . Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit e603784b3a562980e6f1863845097effe2129d3b Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 11:34:41 2014 -0800 Re-add check for empty set of failed stages commit d258f0ef50caff4bbb19fb95a6b82186db1935bf Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 15 23:35:41 2014 -0800 Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs.	2014-02-06 16:15:24 -08:00
Kay Ousterhout	18ad59e2c6	Merge pull request #321 from kayousterhout/ui_kill_fix. Closes #321 . Inform DAG scheduler about all started/finished tasks. Previously, the DAG scheduler was not always informed when tasks started and finished. The simplest example here is for speculated tasks: the DAGScheduler was only told about the first attempt of a task, meaning that SparkListeners were also not told about multiple task attempts, so users can't see what's going on with speculation in the UI. The DAGScheduler also wasn't always told about finished tasks, so in the UI, some tasks will never be shown as finished (this occurs, for example, if a task set gets killed). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit c8d547d0f7a17f5a193bef05f5872b9f475675c5 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 15 16:47:33 2014 -0800 Addressed Reynold's review comments. Always use a TaskEndReason (remove the option), and explicitly signal when we don't know the reason. Also, always tell DAGScheduler (and associated listeners) about started tasks, even when they're speculated. commit 3fee1e2e3c06b975ff7f95d595448f38cce97a04 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 8 22:58:13 2014 -0800 Fixed broken test and improved logging commit ff12fcaa2567c5d02b75a1d5db35687225bcd46f Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Sun Dec 29 21:08:20 2013 -0800 Inform DAG scheduler about all finished tasks. Previously, the DAG scheduler was not always informed when tasks finished. For example, when a task set was aborted, the DAG scheduler was never told when the tasks in that task set finished. The DAG scheduler was also never told about the completion of speculated tasks. This led to confusion with SparkListeners because information about the completion of those tasks was never passed on to the listeners (so in the UI, for example, some tasks will never be shown as finished). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished.	2014-02-06 16:10:48 -08:00
Sandy Ryza	446403b637	Merge pull request #554 from sryza/sandy-spark-1056. Closes #554 . SPARK-1056. Fix header comment in Executor to not imply that it's only u... ...sed for Mesos and Standalone. Author: Sandy Ryza <sandy@cloudera.com> == Merge branch commits == commit 1f2443d902a26365a5c23e4af9077e1539ed2eab Author: Sandy Ryza <sandy@cloudera.com> Date: Thu Feb 6 15:03:50 2014 -0800 SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone	2014-02-06 15:41:16 -08:00
Kay Ousterhout	79c95527a7	Merge pull request #545 from kayousterhout/fix_progress. Closes #545 . Fix off-by-one error with task progress info log. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit 29798fc685c4e7e3eb3bf91c75df7fa8ec94a235 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 13:40:01 2014 -0800 Fix off-by-one error with task progress info log.	2014-02-05 23:38:12 -08:00
CodingCat	18c4ee71e2	Merge pull request #549 from CodingCat/deadcode_master. Closes #549 . remove actorToWorker in master.scala, which is actually not used actorToWorker is actually not used in the code....just remove it Author: CodingCat <zhunansjtu@gmail.com> == Merge branch commits == commit 52656c2d4bbf9abcd8bef65d454badb9cb14a32c Author: CodingCat <zhunansjtu@gmail.com> Date: Thu Feb 6 00:28:26 2014 -0500 remove actorToWorker in master.scala, which is actually not used	2014-02-05 22:08:47 -08:00
Kay Ousterhout	cc14ba974c	Merge pull request #544 from kayousterhout/fix_test_warnings. Closes #544 . Fixed warnings in test compilation. This commit fixes two problems: a redundant import, and a deprecated function. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit da9d2e13ee4102bc58888df0559c65cb26232a82 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 11:41:51 2014 -0800 Fixed warnings in test compilation. This commit fixes two problems: a redundant import, and a deprecated function.	2014-02-05 12:44:24 -08:00
Stevo Slavić	0c05cd374d	Merge pull request #535 from sslavic/patch-2. Closes #535 . Fixed typo in scaladoc Author: Stevo Slavić <sslavic@gmail.com> == Merge branch commits == commit 0a77f789e281930f4168543cc0d3b3ffbf5b3764 Author: Stevo Slavić <sslavic@gmail.com> Date: Tue Feb 4 15:30:27 2014 +0100 Fixed typo in scaladoc	2014-02-04 09:45:46 -08:00
Xiangrui Meng	23af00f9e0	Merge pull request #528 from mengxr/sample. Closes #528 . Refactor RDD sampling and add randomSplit to RDD (update) Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are: 1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513 2) Stratified sampling and importance sampling can be implemented in the same manner as well. Unit tests are included for samplers and RDD.randomSplit. This should performance better than my previous request where the BernoulliSampler creates many Iterator instances: https://github.com/apache/incubator-spark/pull/513 Author: Xiangrui Meng <meng@databricks.com> == Merge branch commits == commit e8ce957e5f0a600f2dec057924f4a2ca6adba373 Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 12:21:08 2014 -0800 more docs to PartitionwiseSampledRDD commit fbb4586d0478ff638b24bce95f75ff06f713d43b Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 00:44:23 2014 -0800 move XORShiftRandom to util.random and use it in BernoulliSampler commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 11:06:59 2014 -0800 relax assertions in SortingSuite because the RangePartitioner has large variance in this case commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:56:28 2014 -0800 test split ratio of RDD.randomSplit commit 8a410bc933a60c4d63852606f8bbc812e416d6ae Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:25:22 2014 -0800 add a test to ensure seed distribution and minor style update commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:06:22 2014 -0800 minor style change commit 750912b4d77596ed807d361347bd2b7e3b9b7a74 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:04:54 2014 -0800 fix some long lines commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:59:59 2014 -0800 add complement to BernoulliSampler and minor style changes commit dbe2bc2bd888a7bdccb127ee6595840274499403 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:45:08 2014 -0800 switch to partition-wise sampling for better performance commit a1fca5232308feb369339eac67864c787455bb23 Merge: `ac712e4` cf6128f Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 16:33:09 2014 -0800 Merge branch 'sample' of github.com:mengxr/incubator-spark into sample commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:40:07 2014 -0800 set SampledRDD deprecated in 1.0 commit f430f847c3df91a3894687c513f23f823f77c255 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:38:59 2014 -0800 update code style commit a8b5e2021a9204e318c80a44d00c5c495f1befb6 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:56:27 2014 -0800 move package random to util.random commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:50:35 2014 -0800 add Apache headers and update code style commit 985609fe1a55655ad11966e05a93c18c138a403d Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:49:25 2014 -0800 add new lines commit b21bddf29850a2c006a868869b8f91960a029322 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:46:35 2014 -0800 move samplers to random.IndependentRandomSampler and add tests commit c02dacb4a941618e434cefc129c002915db08be6 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Jan 25 15:20:24 2014 -0800 add RandomSampler commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 24 13:23:22 2014 -0800 init impl of IndependentlySampledRDD	2014-02-03 13:02:09 -08:00
Aaron Davidson	1625d8c446	Merge pull request #530 from aarondav/cleanup. Closes #530 . Remove explicit conversion to PairRDDFunctions in cogroup() As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the conversion explicitly, however. Author: Aaron Davidson <aaron@databricks.com> == Merge branch commits == commit aa4a63f1bfd5b5178fe67364dd7ce4d84c357996 Author: Aaron Davidson <aaron@databricks.com> Date: Sun Feb 2 23:48:04 2014 -0800 Remove explicit conversion to PairRDDFunctions in cogroup() As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the converion explicitly, however.	2014-02-03 11:25:39 -08:00
Erik Selin	0ff38c2220	Merge pull request #494 from tyro89/worker_registration_issue Issue with failed worker registrations I've been going through the spark source after having some odd issues with workers dying and not coming back. After some digging (I'm very new to scala and spark) I believe I've found a worker registration issue. It looks to me like a failed registration follows the same code path as a successful registration which end up with workers believing they are connected (since they received a `RegisteredWorker` event) even tho they are not registered on the Master. This is a quick fix that I hope addresses this issue (assuming I didn't completely miss-read the code and I'm about to look like a silly person :P) I'm opening this pr now to start a chat with you guys while I do some more testing on my side :) Author: Erik Selin <erik.selin@jadedpixel.com> == Merge branch commits == commit 973012f8a2dcf1ac1e68a69a2086a1b9a50f401b Author: Erik Selin <erik.selin@jadedpixel.com> Date: Tue Jan 28 23:36:12 2014 -0500 break logwarning into two lines to respect line character limit. commit e3754dc5b94730f37e9806974340e6dd93400f85 Author: Erik Selin <erik.selin@jadedpixel.com> Date: Tue Jan 28 21:16:21 2014 -0500 add log warning when worker registration fails due to attempt to re-register on same address. commit 14baca241fa7823e1213cfc12a3ff2a9b865b1ed Author: Erik Selin <erik.selin@jadedpixel.com> Date: Wed Jan 22 21:23:26 2014 -0500 address code style comment commit 71c0d7e6f59cd378d4e24994c21140ab893954ee Author: Erik Selin <erik.selin@jadedpixel.com> Date: Wed Jan 22 16:01:42 2014 -0500 Make a failed registration not persist, not send a `RegisteredWordker` event and not run `schedule` but rather send a `RegisterWorkerFailed` message to the worker attempting to register.	2014-01-29 12:44:54 -08:00
Josh Rosen	1381fc72f7	Switch from MUTF8 to UTF8 in PySpark serializers. This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.	2014-01-28 20:20:08 -08:00
Reynold Xin	84670f2715	Merge pull request #466 from liyinan926/file-overwrite-new Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.	2014-01-27 17:08:35 -08:00
Reynold Xin	f16c21e22f	Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined Replace the check for None Option with isDefined and isEmpty in Scala code Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty. I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand. Pass compile and tests.	2014-01-27 14:24:06 -08:00
Reynold Xin	c40619d487	Merge pull request #504 from JoshRosen/SPARK-1025 Fix PySpark hang when input files are deleted (SPARK-1025) This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.	2014-01-25 22:41:30 -08:00
Josh Rosen	740e865f40	Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) This fixes an issue where collectAsMap() could fail when called on a JavaPairRDD that was derived by transforming a non-JavaPairRDD. The root problem was that we were creating the JavaPairRDD's ClassTag by casting a ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]]. To fix this, I cast a ClassTag[Tuple2[_, _]] instead, since this actually produces a ClassTag of the appropriate type because ClassTags don't capture type parameters: scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res8: Boolean = true scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res9: Boolean = false	2014-01-25 16:41:12 -08:00
Patrick Wendell	3d6e754193	Merge pull request #503 from pwendell/master Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.	2014-01-23 19:47:00 -08:00
Patrick Wendell	ff44732171	Minor fix	2014-01-23 19:23:12 -08:00
Patrick Wendell	c3196171f3	Merge pull request #502 from pwendell/clone-1 Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in various file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.	2014-01-23 19:11:59 -08:00
Patrick Wendell	cad3002fea	Merge pull request #501 from JoshRosen/cartesian-rdd-fixes Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034 This pull request fixes two bugs in PySpark's `cartesian()` method: - [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception - [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result The JIRAs have more details describing the fixes.	2014-01-23 19:08:34 -08:00
Patrick Wendell	268ecbd231	Minor changes after auditing diff from earlier version	2014-01-23 18:30:11 -08:00
Josh Rosen	f83068497b	Fix for SPARK-1025: PySpark hang on missing files.	2014-01-23 18:24:51 -08:00
Patrick Wendell	c58d4ea3d4	Response to Matei's review	2014-01-23 18:12:40 -08:00
Patrick Wendell	0213b4032a	Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.	2014-01-23 18:04:55 -08:00
Patrick Wendell	7101017803	Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in verious file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.	2014-01-23 17:39:23 -08:00
Josh Rosen	61569906cc	Fix SPARK-978: ClassCastException in PySpark cartesian.	2014-01-23 15:09:19 -08:00
Josh Rosen	0035dbbc81	Fix SPARK-1034: Py4JException on PySpark Cartesian Result	2014-01-23 13:05:59 -08:00
Josh Rosen	fad6aacfb0	Merge pull request #406 from eklavya/master Extending Java API coverage Hi, I have added three new methods to JavaRDD. Please review and merge.	2014-01-23 11:14:15 -08:00
eklavya	60e7457266	fixed ClassTag in mapPartitions	2014-01-23 17:40:36 +05:30
Patrick Wendell	a1cd185122	Merge pull request #496 from pwendell/master Fix bug in worker clean-up in UI Introduced in `d5a96fec` (/cc @aarondav). This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.	2014-01-22 19:37:29 -08:00
Patrick Wendell	034dce2a7e	Merge pull request #447 from CodingCat/SPARK-1027 fix for SPARK-1027 fix for SPARK-1027 (https://spark-project.atlassian.net/browse/SPARK-1027) FIXES 1. change sparkhome from String to Option(String) in ApplicationDesc 2. remove sparkhome parameter in LaunchExecutor message 3. adjust involved files	2014-01-22 18:58:02 -08:00
Patrick Wendell	6285513147	Fix bug in worker clean-up in UI Introduced in `d5a96fec`. This should be picked into 0.8 and 0.9 as well.	2014-01-22 18:19:52 -08:00
CodingCat	2b3c461451	refactor sparkHome to val clean code	2014-01-22 20:20:46 -05:00
Kay Ousterhout	19da82c50f	Fixed bug where task set managers are added to queue twice This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality.	2014-01-22 09:52:12 -08:00
Henry Saputra	90ea9d5a8f	Replace the code to check for Option != None with Option.isDefined call in Scala code. This hopefully will make the code cleaner.	2014-01-21 23:22:10 -08:00
Patrick Wendell	a9bcc980b6	Style clean-up	2014-01-21 00:05:28 -08:00
Patrick Wendell	a917a87e02	Adding small code comment	2014-01-20 23:11:45 -08:00
Patrick Wendell	d46df96de3	Avoid matching attempt files in the checkpoint	2014-01-20 20:03:23 -08:00
Patrick Wendell	de526ad527	Remove shuffle files if they are still present on a machine.	2014-01-20 19:11:22 -08:00
Patrick Wendell	f84400e86c	Fixing speculation bug	2014-01-20 19:05:03 -08:00
Patrick Wendell	c324ac10ee	Force use of LZF when spilling data	2014-01-20 19:00:48 -08:00
Patrick Wendell	1b299142a8	Bug fix for reporting of spill output	2014-01-20 18:34:00 -08:00
Patrick Wendell	54867e9566	Minor fixes	2014-01-20 18:33:21 -08:00
Patrick Wendell	cdb003e376	Removing docs on akka options	2014-01-20 16:40:58 -08:00
CodingCat	29f4b6a2d9	fix for SPARK-1027 change TestClient & Worker to Some("xxx") kill manager if it is started remove unnecessary .get when fetch "SPARK_HOME" values	2014-01-20 02:50:30 -05:00
CodingCat	f9a95d6736	executor creation failed should not make the worker restart	2014-01-20 02:50:30 -05:00
Thomas Graves	dd56b2125e	update comment	2014-01-19 12:21:39 -06:00
Thomas Graves	ceb79a3931	Only log error on missing jar to allow spark examples to jar.	2014-01-19 12:16:58 -06:00
Yinan Li	584323c6b1	Addressed comments from Reynold Signed-off-by: Yinan Li <liyinan926@gmail.com>	2014-01-18 21:28:17 -08:00
Patrick Wendell	73dfd42fba	Merge pull request #437 from mridulm/master Minor api usability changes - Expose checkpoint directory - since it is autogenerated now - null check for jars - Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark.	2014-01-18 16:23:56 -08:00
Patrick Wendell	bf5699543b	Merge pull request #462 from mateiz/conf-file-fix Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit `34e911ce9a`) Signed-off-by: Patrick Wendell <pwendell@gmail.com>	2014-01-18 16:20:00 -08:00
Yinan Li	fd833e7ab1	Allow files added through SparkContext.addFile() to be overwritten This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. Signed-off-by: Yinan Li <liyinan926@gmail.com>	2014-01-18 15:26:59 -08:00
Patrick Wendell	5316bcac3c	Use renamed shuffle spill config in CoGroupedRDD.scala	2014-01-18 11:58:42 -08:00
Mridul Muralidharan	b690e11d9c	Address review comment	2014-01-17 18:28:55 +05:30
Patrick Wendell	d4fd89e3c8	Merge pull request #438 from ScrapCodes/clone-records-java-api Clone records java api	2014-01-16 23:17:30 -08:00
Prashant Sharma	fcb4fc653d	adding clone records field to equivaled java apis	2014-01-17 11:16:03 +05:30
Mridul Muralidharan	edd82c58a2	Use method, not variable	2014-01-16 17:26:42 +05:30
Mridul Muralidharan	1a0da89277	Address review comments	2014-01-16 17:23:25 +05:30
Reynold Xin	c06a307ca2	Merge pull request #445 from kayousterhout/exec_lost Fail rather than hanging if a task crashes the JVM. Prior to this commit, if a task crashes the JVM, the task (and all other tasks running on that executor) is marked at KILLED rather than FAILED. As a result, the TaskSetManager will retry the task indefinitely rather than failing the job after maxFailures. Eventually, this makes the job hang, because the Standalone Scheduler removes the application after 10 works have failed, and then the app is left in a state where it's disconnected from the master and waiting to reconnect. This commit fixes that problem by marking tasks as FAILED rather than killed when an executor is lost. The downside of this commit is that if task A fails because another task running on the same executor caused the VM to crash, the failure will incorrectly be counted as a failure of task A. This should not be an issue because we typically set maxFailures to 3, and it is unlikely that a task will be co-located with a JVM-crashing task multiple times.	2014-01-15 23:47:25 -08:00
Kay Ousterhout	718a13c179	Updated unit test comment	2014-01-15 23:46:14 -08:00
Kay Ousterhout	a268d63411	Fail rather than hanging if a task crashes the JVM. Prior to this commit, if a task crashes the JVM, the task (and all other tasks running on that executor) is marked at KILLED rather than FAILED. As a result, the TaskSetManager will retry the task indefiniteily rather than failing the job after maxFailures. This commit fixes that problem by marking tasks as FAILED rather than killed when an executor is lost. The downside of this commit is that if task A fails because another task running on the same executor caused the VM to crash, the failure will incorrectly be counted as a failure of task A. This should not be an issue because we typically set maxFailures to 3, and it is unlikely that a task will be co-located with a JVM-crashing task multiple times.	2014-01-15 16:03:40 -08:00
Patrick Wendell	59f475c79f	Merge pull request #442 from pwendell/standalone Workers should use working directory as spark home if it's not specified If users don't set SPARK_HOME in their environment file when launching an application, the standalone cluster should default to the spark home of the worker.	2014-01-15 13:55:14 -08:00
Patrick Wendell	00a3f7eec5	Workers should use working directory as spark home if it's not specified	2014-01-15 11:05:36 -08:00
Mridul Muralidharan	0aea33d39e	Expose method and class - so that we can use it from user code (particularly since checkpoint directory is autogenerated now	2014-01-15 12:44:44 +05:30
Tathagata Das	0e15bd7827	Merge remote-tracking branch 'apache/master' into filestream-fix	2014-01-14 22:21:20 -08:00

1 2 3 4 5 ...

3232 commits