ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yash Datta	9bc0df6804	SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions takeOrdered should skip reduce step in case mapped RDDs have no partitions. This prevents the mentioned exception : 4. run query SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 100; Error trace java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.reduce(RDD.scala:863) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136) Author: Yash Datta <Yash.Datta@guavus.com> Closes #3830 from saucam/fix_takeorder and squashes the following commits: 5974d10 [Yash Datta] SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions	2014-12-29 13:49:45 -08:00
Burak Yavuz	02b55de3dc	[SPARK-4409][MLlib] Additional Linear Algebra Utils Addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - zeros: Generate a matrix consisting of zeros - ones: Generate a matrix consisting of ones - eye: Generate an identity matrix - rand: Generate a matrix consisting of i.i.d. uniform random numbers - randn: Generate a matrix consisting of i.i.d. gaussian random numbers - diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`. - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix. - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix. The names for these methods were selected from MATLAB Author: Burak Yavuz <brkyvz@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3319 from brkyvz/SPARK-4409 and squashes the following commits: b0354f6 [Burak Yavuz] [SPARK-4409] Incorporated mengxr's code 04c4829 [Burak Yavuz] Merge pull request #1 from mengxr/SPARK-4409 80cfa29 [Xiangrui Meng] minor changes ecc937a [Xiangrui Meng] update sprand 4e95e24 [Xiangrui Meng] simplify fromCOO implementation 10a63a6 [Burak Yavuz] [SPARK-4409] Fourth pass of code review f62d6c7 [Burak Yavuz] [SPARK-4409] Modified genRandMatrix 3971c93 [Burak Yavuz] [SPARK-4409] Third pass of code review 75239f8 [Burak Yavuz] [SPARK-4409] Second pass of code review e4bd0c0 [Burak Yavuz] [SPARK-4409] Modified horzcat and vertcat 65c562e [Burak Yavuz] [SPARK-4409] Hopefully fixed Java Test d8be7bc [Burak Yavuz] [SPARK-4409] Organized imports 065b531 [Burak Yavuz] [SPARK-4409] First pass after code review a8120d2 [Burak Yavuz] [SPARK-4409] Finished updates to API according to SPARK-4614 f798c82 [Burak Yavuz] [SPARK-4409] Updated API according to SPARK-4614 c75f3cd [Burak Yavuz] [SPARK-4409] Added JavaAPI Tests, and fixed a couple of bugs d662f9d [Burak Yavuz] [SPARK-4409] Modified according to remote repo 83dfe37 [Burak Yavuz] [SPARK-4409] Scalastyle error fixed a14c0da [Burak Yavuz] [SPARK-4409] Initial commit to add methods	2014-12-29 13:24:26 -08:00
Kousuke Saruta	8d72341ab7	[Minor] Fix a typo of type parameter in JavaUtils.scala In JavaUtils.scala, thare is a typo of type parameter. In addition, the type information is removed at the time of compile by erasure. This issue is really minor so I don't file in JIRA. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3789 from sarutak/fix-typo-in-javautils and squashes the following commits: e20193d [Kousuke Saruta] Fixed a typo of type parameter 82bc5d9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-typo-in-javautils 99f6f63 [Kousuke Saruta] Fixed a typo of type parameter in JavaUtils.scala	2014-12-29 12:05:08 -08:00
YanTangZhai	815de54002	[SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem Author: YanTangZhai <hakeemzhai@tencent.com> Author: yantangzhai <tyz0303@163.com> Closes #3785 from YanTangZhai/SPARK-4946 and squashes the following commits: 9ca6541 [yantangzhai] [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem e4c2c0a [YanTangZhai] Merge pull request #15 from apache/master 718afeb [YanTangZhai] Merge pull request #12 from apache/master 6e643f8 [YanTangZhai] Merge pull request #11 from apache/master e249846 [YanTangZhai] Merge pull request #10 from apache/master d26d982 [YanTangZhai] Merge pull request #9 from apache/master 76d4027 [YanTangZhai] Merge pull request #8 from apache/master 03b62b0 [YanTangZhai] Merge pull request #7 from apache/master 8a00106 [YanTangZhai] Merge pull request #6 from apache/master cbcba66 [YanTangZhai] Merge pull request #3 from apache/master cdef539 [YanTangZhai] Merge pull request #1 from apache/master	2014-12-29 11:30:54 -08:00
Kousuke Saruta	4cef05e1c1	Adde LICENSE Header to build/mvn, build/sbt and sbt/sbt Recently, build/mvn and build/sbt are added, and sbt/sbt is changed but there are no license headers. Should we add license headers to the scripts right? If it's not right, please let me correct. This PR doesn't affect behavior of Spark, I don't file in JIRA. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3817 from sarutak/add-license-header and squashes the following commits: 1abc972 [Kousuke Saruta] Added LICENSE Header	2014-12-29 10:48:53 -08:00
wangxiaojing	6645e52580	[SPARK-4982][DOC] `spark.ui.retainedJobs` description is wrong in Spark UI configuration guide Author: wangxiaojing <u9jing@gmail.com> Closes #3818 from wangxiaojing/SPARK-4982 and squashes the following commits: fe2ad5f [wangxiaojing] change stages to jobs	2014-12-29 10:45:26 -08:00
meiyoula	14fa87bdf4	[SPARK-4966][YARN]The MemoryOverhead value is setted not correctly Author: meiyoula <1039320815@qq.com> Closes #3797 from XuTingjun/MemoryOverhead and squashes the following commits: 5a780fc [meiyoula] Update ClientArguments.scala	2014-12-29 08:20:30 -06:00
Brennon York	a3e51cc990	[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would. Author: Brennon York <brennon.york@capitalone.com> Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits: 0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this) 9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn b979c58 [Brennon York] updated the merge conflict c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set 28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead 7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed 14a5da0 [Brennon York] removed unnecessary zinc output on startup 1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable 3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed a680d12 [Brennon York] Added comments to functions and tested various mvn calls bb8cc9d [Brennon York] removed package files ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli 07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version f914dea [Brennon York] Beginning final portions of localized scala home 69c4e44 [Brennon York] working linux and osx installers for purely local mvn build 4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability	2014-12-27 13:26:38 -08:00
GuoQiang Li	080ceb771a	[SPARK-4952][Core]Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails Author: GuoQiang Li <witgo@qq.com> Closes #3788 from witgo/SPARK-4952 and squashes the following commits: d903529 [GuoQiang Li] Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails	2014-12-26 23:31:29 -08:00
Zhang, Liye	786808abfd	[SPARK-4954][Core] add spark version infomation in log for standalone mode The master and worker spark version may be not the same with Driver spark version. That is because spark Jar file might be replaced for new application without restarting the spark cluster. So there shall log out the spark-version in both Mater and Worker log. Author: Zhang, Liye <liye.zhang@intel.com> Closes #3790 from liyezhang556520/version4Standalone and squashes the following commits: e05e1e3 [Zhang, Liye] add spark version infomation in log for standalone mode	2014-12-26 23:24:22 -08:00
Jongyoul Lee	2483c1efb6	[SPARK-3955] Different versions between jackson-mapper-asl and jackson-c... ...ore-asl - set the same version to jackson-mapper-asl and jackson-core-asl - It's related with #2818 - coded a same patch from a latest master Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3716 from jongyoul/SPARK-3955 and squashes the following commits: efa29aa [Jongyoul Lee] [SPARK-3955] Different versions between jackson-mapper-asl and jackson-core-asl - set the same version to jackson-mapper-asl and jackson-core-asl	2014-12-26 22:59:34 -08:00
Patrick Wendell	82bf4bee15	HOTFIX: Slight tweak on previous commit. Meant to merge this in when committing SPARK-3787.	2014-12-26 22:55:04 -08:00
Kousuke Saruta	de95c57ac6	[SPARK-3787][BUILD] Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version This PR is another solution for When we build with sbt with profile for hadoop and without property for hadoop version like: sbt/sbt -Phadoop-2.2 assembly jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile is used. For instance, if we build like: mvn -Phadoop-2.2 package jar name is used hadoop2.2.0 as a default version of hadoop-2.2. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3046 from sarutak/fix-assembly-jarname-2 and squashes the following commits: 41ef90e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname-2 50c8676 [Kousuke Saruta] Merge branch 'fix-assembly-jarname-2' of github.com:sarutak/spark into fix-assembly-jarname-2 52a1cd2 [Kousuke Saruta] Fixed comflicts dd30768 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname2 f1c90bb [Kousuke Saruta] Fixed SparkBuild.scala in order to read `hadoop.version` property from pom.xml af6b100 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname c81806b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname ad1f96e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname b2318eb [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname 5fc1259 [Kousuke Saruta] Fixed typo. eebbb7d [Kousuke Saruta] Fixed wrong jar name	2014-12-26 22:52:04 -08:00
Patrick Wendell	534f24b2d0	MAINTENANCE: Automated closing of pull requests. This commit exists to close the following pull requests on Github: Closes #3456 (close requested by 'pwendell') Closes #1602 (close requested by 'tdas') Closes #2633 (close requested by 'tdas') Closes #2059 (close requested by 'JoshRosen') Closes #2348 (close requested by 'tdas') Closes #3662 (close requested by 'tdas') Closes #2031 (close requested by 'andrewor14') Closes #265 (close requested by 'JoshRosen')	2014-12-26 22:39:56 -08:00
CodingCat	fda4331d58	SPARK-4971: Fix typo in BlockGenerator comment Author: CodingCat <zhunansjtu@gmail.com> Closes #3807 from CodingCat/new_branch and squashes the following commits: 5167f01 [CodingCat] fix typo in the comment	2014-12-26 12:04:46 -08:00
zsxwing	f9ed2b6641	[SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397). Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine. ```Scala import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object StreamingApp { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount") val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.textFileStream("/some/path") val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } ``` Author: zsxwing <zsxwing@gmail.com> Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits: aa6d44a [zsxwing] Fix a copy-paste error f74c190 [zsxwing] Merge branch 'master' into SPARK-4608 e6f9cc9 [zsxwing] Update the docs 27833bb [zsxwing] Remove `import StreamingContext._` c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience	2014-12-25 19:46:05 -08:00
jerryshao	f205fe477c	[SPARK-4537][Streaming] Expand StreamingSource to add more metrics Add `processingDelay`, `schedulingDelay` and `totalDelay` for the last completed batch. Add `lastReceivedBatchRecords` and `totalReceivedBatchRecords` to the received records counting. Author: jerryshao <saisai.shao@intel.com> Closes #3466 from jerryshao/SPARK-4537 and squashes the following commits: 00f5f7f [jerryshao] Change the code style and add totalProcessedRecords 44721a6 [jerryshao] Further address the comments c097ddc [jerryshao] Address the comments 02dd44f [jerryshao] Fix the addressed comments c7a9376 [jerryshao] Expand StreamingSource to add more metrics	2014-12-25 19:39:49 -08:00
Nicholas Chammas	ac8278593e	[EC2] Update mesos/spark-ec2 branch to branch-1.3 Going forward, we'll use matching branch names across the mesos/spark-ec2 and apache/spark repositories, per [the discussion here](https://github.com/mesos/spark-ec2/pull/85#issuecomment-68069589). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3804 from nchammas/patch-2 and squashes the following commits: cd2c0d4 [Nicholas Chammas] [EC2] Update mesos/spark-ec2 branch to branch-1.3	2014-12-25 14:16:50 -08:00
Nicholas Chammas	b6b6393b47	[EC2] Update default Spark version to 1.2.0 Now that 1.2.0 is out, let's update the default Spark version. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3793 from nchammas/patch-1 and squashes the following commits: 3255832 [Nicholas Chammas] add 1.2.0 version to Spark-Shark map ec0e904 [Nicholas Chammas] [EC2] Update default Spark version to 1.2.0	2014-12-25 14:13:53 -08:00
Denny Lee	08b18c7eb7	Fix "Building Spark With Maven" link in README.md Corrected link to the Building Spark with Maven page from its original (http://spark.apache.org/docs/latest/building-with-maven.html) to the current page (http://spark.apache.org/docs/latest/building-spark.html) Author: Denny Lee <denny.g.lee@gmail.com> Closes #3802 from dennyglee/patch-1 and squashes the following commits: 15f601a [Denny Lee] Update README.md	2014-12-25 14:06:01 -08:00
Kousuke Saruta	11dd99317b	[SPARK-4953][Doc] Fix the description of building Spark with YARN At the section "Specifying the Hadoop Version" In building-spark.md, there is description about building with YARN with Hadoop 0.23. Spark 1.3.0 will not support Hadoop 0.23 so we should fix the description. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3787 from sarutak/SPARK-4953 and squashes the following commits: ee9c355 [Kousuke Saruta] Removed description related to a specific vendor 9ab0c24 [Kousuke Saruta] Fix the description about building SPARK with YARN	2014-12-25 07:05:43 -08:00
zsxwing	b4d0db80a0	[SPARK-4873][Streaming] Use `Future.zip` instead of `Future.flatMap`(for-loop) in WriteAheadLogBasedBlockHandler Use `Future.zip` instead of `Future.flatMap`(for-loop). `zip` implies these two Futures will run concurrently, while `flatMap` usually means one Future depends on the other one. Author: zsxwing <zsxwing@gmail.com> Closes #3721 from zsxwing/SPARK-4873 and squashes the following commits: 46a2cd9 [zsxwing] Use Future.zip instead of Future.flatMap(for-loop)	2014-12-24 19:49:41 -08:00
Sean Owen	29fabb1b52	SPARK-4297 [BUILD] Build warning fixes omnibus There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now. Author: Sean Owen <sowen@cloudera.com> Closes #3157 from srowen/SPARK-4297 and squashes the following commits: 8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes	2014-12-24 13:32:51 -08:00
Kousuke Saruta	199e59aacd	[SPARK-4881][Minor] Use SparkConf#getBoolean instead of get().toBoolean It's really a minor issue. In ApplicationMaster, there is code like as follows. val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", "false").toBoolean I think, the code can be simplified like as follows. val preserveFiles = sparkConf.getBoolean("spark.yarn.preserve.staging.files", false) Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3733 from sarutak/SPARK-4881 and squashes the following commits: 1771430 [Kousuke Saruta] Modified the code like sparkConf.get(...).toBoolean to sparkConf.getBoolean(...) c63daa0 [Kousuke Saruta] Simplified code	2014-12-23 19:14:34 -08:00
jbencook	fd41eb9574	[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects. In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`. Author: jbencook <jbenjamincook@gmail.com> Author: J. Benjamin Cook <jbenjamincook@gmail.com> Closes #3764 from jbencook/master and squashes the following commits: 6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments 5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD 020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`	2014-12-23 17:46:24 -08:00
Marcelo Vanzin	7e2deb71c4	[SPARK-4606] Send EOF to child JVM when there's no more data to read. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3460 from vanzin/SPARK-4606 and squashes the following commits: 031207d [Marcelo Vanzin] [SPARK-4606] Send EOF to child JVM when there's no more data to read.	2014-12-23 16:07:59 -08:00
jerryshao	3f5f4cc4e7	[SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled Currently streaming block will be replicated when specific storage level is set, since WAL is already fault tolerant, so replication is needless and will hurt the throughput of streaming application. Hi tdas , as per discussed about this issue, I fixed with this implementation, I'm not is this the way you want, would you mind taking a look at it? Thanks a lot. Author: jerryshao <saisai.shao@intel.com> Closes #3534 from jerryshao/SPARK-4671 and squashes the following commits: 500b456 [jerryshao] Do not replicate streaming block when WAL is enabled	2014-12-23 15:45:53 -08:00
Ilayaperumal Gopinathan	10d69e9cbf	[SPARK-4802] [streaming] Remove receiverInfo once receiver is de-registered Once the streaming receiver is de-registered at executor, the `ReceiverTrackerActor` needs to remove the corresponding reveiverInfo from the `receiverInfo` map at `ReceiverTracker`. Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io> Closes #3647 from ilayaperumalg/receiverInfo-RTracker and squashes the following commits: 6eb97d5 [Ilayaperumal Gopinathan] Polishing based on the review 3640c86 [Ilayaperumal Gopinathan] Remove receiverInfo once receiver is de-registered	2014-12-23 15:14:54 -08:00
Liang-Chi Hsieh	96281cd0c3	[SPARK-4913] Fix incorrect event log path SPARK-2261 uses a single file to log events for an app. `eventLogDir` in `ApplicationDescription` is replaced with `eventLogFile`. However, `ApplicationDescription` in `SparkDeploySchedulerBackend` is initialized with `SparkContext`'s `eventLogDir`. It is just the log directory, not the actual log file path. `Master.rebuildSparkUI` can not correctly rebuild a new SparkUI for the app. Because the `ApplicationDescription` is remotely registered with `Master` and the app's id is then generated in `Master`, we can not get the app id in advance before registration. So the received description needs to be modified with correct `eventLogFile` value. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3755 from viirya/fix_app_logdir and squashes the following commits: 5e0ea35 [Liang-Chi Hsieh] Revision for comment. b5730a1 [Liang-Chi Hsieh] Fix incorrect event log path. Closes #3777 (a duplicate PR for the same JIRA)	2014-12-23 14:58:44 -08:00
Andrew Or	27c5399f4d	[SPARK-4730][YARN] Warn against deprecated YARN settings See https://issues.apache.org/jira/browse/SPARK-4730. Author: Andrew Or <andrew@databricks.com> Closes #3590 from andrewor14/yarn-settings and squashes the following commits: 36e0753 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-settings dcd1316 [Andrew Or] Warn against deprecated YARN settings	2014-12-23 14:28:36 -08:00
Cheng Lian	395b771fee	[SPARK-4914][Build] Cleans lib_managed before compiling with Hive 0.13.1 This PR tries to fix the Hive tests failure encountered in PR #3157 by cleaning `lib_managed` before building assembly jar against Hive 0.13.1 in `dev/run-tests`. Otherwise two sets of datanucleus jars would be left in `lib_managed` and may mess up class paths while executing Hive test suites. Please refer to [this thread] [1] for details. A clean build would be even safer, but we only clean `lib_managed` here to save build time. This PR also takes the chance to clean up some minor typos and formatting issues in the comments. [1]: https://github.com/apache/spark/pull/3157#issuecomment-67656488 <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3756) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3756 from liancheng/clean-lib-managed and squashes the following commits: e2bd21d [Cheng Lian] Adds lib_managed to clean set c9f2f3e [Cheng Lian] Cleans lib_managed before compiling with Hive 0.13.1	2014-12-23 12:54:20 -08:00
Takeshi Yamamuro	9c251c555f	[SPARK-4932] Add help comments in Analytics Trivial modifications for usability. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #3775 from maropu/AddHelpCommentInAnalytics and squashes the following commits: fbea8f5 [Takeshi Yamamuro] Add help comments in Analytics	2014-12-23 12:39:41 -08:00
Marcelo Vanzin	dd155369a0	[SPARK-4834] [standalone] Clean up application files after app finishes. Commit `7aacb7bfa` added support for sharing downloaded files among multiple executors of the same app. That works great in Yarn, since the app's directory is cleaned up after the app is done. But Spark standalone mode didn't do that, so the lock/cache files created by that change were left around and could eventually fill up the disk hosting /tmp. To solve that, create app-specific directories under the local dirs when launching executors. Multiple executors launched by the same Worker will use the same app directories, so they should be able to share the downloaded files. When the application finishes, a new message is sent to all workers telling them the application has finished; once that message has been received, and all executors registered for the application shut down, then those directories will be cleaned up by the Worker. Note: Unit testing this is hard (if even possible), since local-cluster mode doesn't seem to leave the Master/Worker daemons running long enough after `sc.stop()` is called for the clean up protocol to take effect. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3705 from vanzin/SPARK-4834 and squashes the following commits: b430534 [Marcelo Vanzin] Remove seemingly unnecessary synchronization. 50eb4b9 [Marcelo Vanzin] Review feedback. c0e5ea5 [Marcelo Vanzin] [SPARK-4834] [standalone] Clean up application files after app finishes.	2014-12-23 12:02:08 -08:00
zsxwing	2d215aebaa	[SPARK-4931][Yarn][Docs] Fix the format of running-on-yarn.md Currently, the format about log4j in running-on-yarn.md is a bit messy. ![running-on-yarn](https://cloud.githubusercontent.com/assets/1000778/5535248/204c4b64-8ab4-11e4-83c3-b4722ea0ad9d.png) Author: zsxwing <zsxwing@gmail.com> Closes #3774 from zsxwing/SPARK-4931 and squashes the following commits: 4a5f853 [zsxwing] Fix the format of running-on-yarn.md	2014-12-23 11:18:06 -08:00
Nicholas Chammas	2823c7f021	[SPARK-4890] Ignore downloaded EC2 libs PR #3737 changed `spark-ec2` to automatically download boto from PyPI. This PR tell git to ignore those downloaded library files. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3770 from nchammas/ignore-ec2-lib and squashes the following commits: 5c440d3 [Nicholas Chammas] gitignore downloaded EC2 libs	2014-12-23 11:12:16 -08:00
Nicholas Chammas	0e532ccb2b	[Docs] Minor typo fixes Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3772 from nchammas/patch-1 and squashes the following commits: b7d9083 [Nicholas Chammas] [Docs] Minor typo fixes	2014-12-22 22:54:32 -08:00
DB Tsai	a96b72781a	[SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R In most of the academic paper and algorithm implementations, people use L = 1/2n \|\|A weights-y\|\|^2 instead of L = 1/n \|\|A weights-y\|\|^2 for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf Since MLlib uses different convention, this will result different residuals and all the stats properties will be different from GLMNET package in R. The model coefficients will be still the same under this change. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3746 from dbtsai/lir and squashes the following commits: 19c2e85 [DB Tsai] make stepsize twice to converge to the same solution 0b2c29c [DB Tsai] first commit	2014-12-22 16:42:55 -08:00
zsxwing	c233ab3d8d	[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <zsxwing@gmail.com> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join	2014-12-22 14:26:28 -08:00
genmao.ygm	de9d7d2b5b	[SPARK-4920][UI]:current spark version in UI is not striking. It is not convenient to see the Spark version. We can keep the same style with Spark website. ![spark_version](https://cloud.githubusercontent.com/assets/7402327/5527025/1c8c721c-8a35-11e4-8d6a-2734f3c6bdf8.jpg) Author: genmao.ygm <genmao.ygm@alibaba-inc.com> Closes #3763 from uncleGen/master-clean-141222 and squashes the following commits: 0dcb9a9 [genmao.ygm] [SPARK-4920][UI]:current spark version in UI is not striking.	2014-12-22 14:14:39 -08:00
Liang-Chi Hsieh	a61aa669af	[Minor] Fix scala doc Minor fix for an obvious scala doc error. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3751 from viirya/fix_scaladoc and squashes the following commits: 03fddaa [Liang-Chi Hsieh] Fix scala doc.	2014-12-22 14:13:31 -08:00
Aaron Davidson	fbca6b6ce2	[SPARK-4864] Add documentation to Netty-based configs Author: Aaron Davidson <aaron@databricks.com> Closes #3713 from aarondav/netty-configs and squashes the following commits: 8a8b373 [Aaron Davidson] Address Patrick's comments 3b1f84e [Aaron Davidson] [SPARK-4864] Add documentation to Netty-based configs	2014-12-22 13:09:22 -08:00
Kostas Sakellis	7c0ed13d29	[SPARK-4079] [CORE] Consolidates Errors if a CompressionCodec is not available This commit consolidates some of the exceptions thrown if compression codecs are not available. If a bad configuration string was passed in, a ClassNotFoundException was through. Also, if Snappy was not available, it would throw an InvocationTargetException when the codec was being used (not when it was being initialized). Now, an IllegalArgumentException is thrown when a codec is not available at creation time - either because the class does not exist or the codec itself is not available in the system. This will allow us to have a better message and fail faster. Author: Kostas Sakellis <kostas@cloudera.com> Closes #3119 from ksakellis/kostas-spark-4079 and squashes the following commits: 9709c7c [Kostas Sakellis] Removed unnecessary Logging class 63bfdd0 [Kostas Sakellis] Removed isAvailable to preserve binary compatibility 1d0ef2f [Kostas Sakellis] [SPARK-4079] [CORE] Added more information to exception 64f3d27 [Kostas Sakellis] [SPARK-4079] [CORE] Code review feedback 52dfa8f [Kostas Sakellis] [SPARK-4079] [CORE] Default to LZF if Snappy not available	2014-12-22 13:07:01 -08:00
Sandy Ryza	d62da642ac	SPARK-4447. Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha Author: Sandy Ryza <sandy@cloudera.com> Closes #3652 from sryza/sandy-spark-4447 and squashes the following commits: 2791158 [Sandy Ryza] Review feedback c23507b [Sandy Ryza] Strip margin from client arguments help string 18be7ba [Sandy Ryza] SPARK-4447	2014-12-22 12:23:43 -08:00
Takeshi Yamamuro	fb8e85e80e	[SPARK-4733] Add missing prameter comments in ShuffleDependency Add missing Javadoc comments in ShuffleDependency. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #3594 from maropu/DependencyJavadocFix and squashes the following commits: 32129b4 [Takeshi Yamamuro] Fix comments in @aggregator and @mapSideCombine 303c75d [Takeshi Yamamuro] [SPARK-4733] Add missing prameter comments in ShuffleDependency	2014-12-22 12:19:23 -08:00
carlmartin	1d9788e42e	[Minor] Improve some code in BroadcastTest for short Using val arr1 = (0 until num).toArray instead of val arr1 = new Array[Int](num) for (i <- 0 until arr1.length) { arr1(i) = i } for short. Author: carlmartin <carlmartinmax@gmail.com> Closes #3750 from SaintBacchus/BroadcastTest and squashes the following commits: 43adb70 [carlmartin] Improve some code in BroadcastTest for short	2014-12-22 12:13:53 -08:00
zsxwing	8773705fd4	[SPARK-4883][Shuffle] Add a name to the directoryCleaner thread Author: zsxwing <zsxwing@gmail.com> Closes #3734 from zsxwing/SPARK-4883 and squashes the following commits: e6f2b61 [zsxwing] Fix the name cc74727 [zsxwing] Add a name to the directoryCleaner thread	2014-12-22 12:11:36 -08:00
Zhang, Liye	39272c8cdb	[SPARK-4870] Add spark version to driver log Author: Zhang, Liye <liye.zhang@intel.com> Closes #3717 from liyezhang556520/version2Log and squashes the following commits: ccd30d7 [Zhang, Liye] delete log in sparkConf 330f70c [Zhang, Liye] move the log from SaprkConf to SparkContext 96dc115 [Zhang, Liye] remove curly brace e833330 [Zhang, Liye] add spark version to driver log	2014-12-22 11:38:28 -08:00
Tsuyoshi Ozawa	96606f69b7	[SPARK-4915][YARN] Fix classname to be specified for external shuffle service. Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@lab.ntt.co.jp> Closes #3757 from oza/SPARK-4915 and squashes the following commits: 3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service.	2014-12-22 11:28:05 -08:00
zsxwing	93b2f3a882	[SPARK-4918][Core] Reuse Text in saveAsTextFile Reuse Text in saveAsTextFile to reduce GC. /cc rxin Author: zsxwing <zsxwing@gmail.com> Closes #3762 from zsxwing/SPARK-4918 and squashes the following commits: 59f03eb [zsxwing] Reuse Text in saveAsTextFile	2014-12-22 11:20:00 -08:00
zsxwing	6ee6aa70b7	[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ `NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes. I used the following commands to confirm the generated byte codes are some. ``` mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt diff ~/hadoop1.txt ~/hadoop2.txt ``` However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`. To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`. Author: zsxwing <zsxwing@gmail.com> Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits: 39d9df2 [zsxwing] Fix the code style e4ad8b5 [zsxwing] Use null for the implicit Ordering 734bac9 [zsxwing] Explicitly set the implicit parameters ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD	2014-12-21 22:10:19 -08:00

1 2 3 4 5 ...

9225 commits