ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	7c160293d6	[SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits: e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.	2014-06-05 11:27:33 -07:00
Marcelo Vanzin	b77c19be05	Fix issue in ReplSuite with hadoop-provided profile. When building the assembly with the maven "hadoop-provided" profile, the executors were failing to come up because Hadoop classes were not found in the classpath anymore; so add them explicitly to the classpath using spark.executor.extraClassPath. This is only needed for the local-cluster mode, but doesn't affect other tests, so it's added for all of them to keep the code simpler. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #781 from vanzin/repl-test-fix and squashes the following commits: 4f0a3b0 [Marcelo Vanzin] Fix issue in ReplSuite with hadoop-provided profile.	2014-06-04 22:56:49 -07:00
Ankur Dave	abea2d4ff0	Minor: Fix documentation error from apache/spark#946 Author: Ankur Dave <ankurdave@gmail.com> Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from apache/spark#946	2014-06-04 16:45:53 -07:00
Varakhedi Sujeet	11ded3f66f	SPARK-1790: Update EC2 scripts to support r3 instance types Author: Varakhedi Sujeet <svarakhedi@gopivotal.com> Closes #960 from sujeetv/ec2-r3 and squashes the following commits: 3cb9fd5 [Varakhedi Sujeet] SPARK-1790: Update EC2 scripts to support r3 instance	2014-06-04 16:02:23 -07:00
Colin McCabe	1765c8d0dd	SPARK-1518: FileLogger: Fix compile against Hadoop trunk In Hadoop trunk (currently Hadoop 3.0.0), the deprecated FSDataOutputStream#sync() method has been removed. Instead, we should call FSDataOutputStream#hflush, which does the same thing as the deprecated method used to do. Author: Colin McCabe <cmccabe@cloudera.com> Closes #898 from cmccabe/SPARK-1518 and squashes the following commits: 752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk	2014-06-04 15:56:29 -07:00
Xiangrui Meng	189df165bb	[SPARK-1752][MLLIB] Standardize text format for vectors and labeled points We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following: 1. dense vector: `[v0,v1,..]` 2. sparse vector: `(size,[i0,i1],[v0,v1])` 3. labeled point: `(label,vector)` where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically. `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`. CC: @mateiz, @srowen Author: Xiangrui Meng <meng@databricks.com> Closes #685 from mengxr/labeled-io and squashes the following commits: 2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1 297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io 56746ea [Xiangrui Meng] replace # by . 623a5f0 [Xiangrui Meng] merge master f06d5ba [Xiangrui Meng] add docs and minor updates 640fe0c [Xiangrui Meng] throw SparkException 5bcfbc4 [Xiangrui Meng] update test to add scientific notations e86bf38 [Xiangrui Meng] remove NumericTokenizer 050fca4 [Xiangrui Meng] use StringTokenizer 6155b75 [Xiangrui Meng] merge master f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests aea4ae3 [Xiangrui Meng] minor updates 810d6df [Xiangrui Meng] update tokenizer/parser implementation 7aac03a [Xiangrui Meng] remove Scala parsers c1885c1 [Xiangrui Meng] add headers and minor changes b0c50cb [Xiangrui Meng] add customized parser d731817 [Xiangrui Meng] style update 63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors 5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors 7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__ e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData 9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints 19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint	2014-06-04 12:56:56 -07:00
Sean Owen	d341b17c2a	SPARK-1973. Add randomSplit to JavaRDD (with tests, and tidy Java tests) I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?) Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change. Author: Sean Owen <sowen@cloudera.com> Author: Xiangrui Meng <meng@databricks.com> This patch had conflicts when merged, resolved by Committer: Xiangrui Meng <meng@databricks.com> Closes #919 from srowen/SPARK-1973 and squashes the following commits: 148cb7b [Sean Owen] Some final Java test polish, while we are at it 1fc3f3e [Xiangrui Meng] more cleaning on Java 8 tests 9ebc57f [Sean Owen] Use accumulator instead of temp files to test foreach 5efb0be [Sean Owen] Add Java randomSplit, and unit tests (including for sample) 5dcc158 [Sean Owen] Simplified Java 8 test with new language features, and fixed the name of MLB's greatest team 91a1769 [Sean Owen] Touch up minor style issues in existing Java API suite test	2014-06-04 11:27:08 -07:00
Neville Li	b8d2580039	[MLLIB] set RDD names in ALS This is very useful when debugging & fine tuning jobs with large data sets. Author: Neville Li <neville@spotify.com> Closes #966 from nevillelyh/master and squashes the following commits: 6747764 [Neville Li] [MLLIB] use string interpolation for RDD names 3b15d34 [Neville Li] [MLLIB] set RDD names in ALS	2014-06-04 01:51:34 -07:00
Kan Zhang	c402a4a685	[SPARK-1817] RDD.zip() should verify partition sizes for each partition RDD.zip() will throw an exception if it finds partition sizes are not the same. Author: Kan Zhang <kzhang@apache.org> Closes #944 from kanzhang/SPARK-1817 and squashes the following commits: c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates 524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition	2014-06-03 22:47:18 -07:00
Sean Owen	4ca0625669	SPARK-1806 (addendum) Use non-deprecated methods in Mesos 0.18 The update to Mesos 0.18 caused some deprecation warnings in the build. The change to the non-deprecated version is straightforward as it emulates what the Mesos driver does with the deprecated method anyway (`c5aa1dd221/src/sched/sched.cpp (L1354)`) Author: Sean Owen <sowen@cloudera.com> Closes #920 from srowen/SPARK-1806 and squashes the following commits: 8d76b6a [Sean Owen] Use non-deprecated methods in Mesos 0.18	2014-06-03 22:37:20 -07:00
Aaron Davidson	ab7c62d573	Update spark-ec2 scripts for 1.0.0 on master The change was previously committed only to branch-1.0 as part of `a34e6fda1d` Author: Aaron Davidson <aaron@databricks.com> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <pwendell@gmail.com> Closes #938 from aarondav/sparkec2 and squashes the following commits: 067cc31 [Aaron Davidson] Update spark-ec2 scripts for 1.0.0 on master	2014-06-03 22:33:04 -07:00
Joseph E. Gonzalez	5284ca78d1	Enable repartitioning of graph over different number of partitions It is currently very difficult to repartition a graph over a different number of partitions. This PR adds an additional `partitionBy` function that takes the number of partitions. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #719 from jegonzal/graph_partitioning_options and squashes the following commits: 730b405 [Joseph E. Gonzalez] adding an additional number of partitions option to partitionBy	2014-06-03 20:49:14 -07:00
Xiangrui Meng	e8d93ee528	use env default python in merge_spark_pr.py A minor change to use env default python instead of fixed `/usr/bin/python`. Author: Xiangrui Meng <meng@databricks.com> Closes #965 from mengxr/merge-pr-python and squashes the following commits: 1ae0013 [Xiangrui Meng] use env default python in merge_spark_pr.py	2014-06-03 18:53:13 -07:00
Reynold Xin	1faef149f7	SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog. I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results). Author: Reynold Xin <rxin@apache.org> Closes #897 from rxin/hll and squashes the following commits: 4d83f41 [Reynold Xin] New error bound and non-randomness. f154ea0 [Reynold Xin] Added a comment on the value bound for testing. e367527 [Reynold Xin] One more round of code review. 41e649a [Reynold Xin] Update final mima list. 9e320c8 [Reynold Xin] Incorporate code review feedback. e110d70 [Reynold Xin] Merge branch 'master' into hll 354deb8 [Reynold Xin] Added comment on the Mima exclude rules. acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes. 6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes. 1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check. 9221b27 [Reynold Xin] Merge branch 'master' into hll 88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility. 1294be6 [Reynold Xin] Updated HLL+. e7786cb [Reynold Xin] Merge branch 'master' into hll c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.	2014-06-03 18:37:40 -07:00
Kan Zhang	21e40ed88b	[SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in Python Author: Kan Zhang <kzhang@apache.org> Closes #755 from kanzhang/SPARK-1161 and squashes the following commits: 24ed8a2 [Kan Zhang] [SPARK-1161] Fixing doc tests 44e0615 [Kan Zhang] [SPARK-1161] Adding an optional batchSize with default value 10 d929429 [Kan Zhang] [SPARK-1161] Add saveAsObjectFile and SparkContext.objectFile in Python	2014-06-03 18:18:25 -07:00
DB Tsai	f4dd665c85	Fixed a typo in RowMatrix.scala Author: DB Tsai <dbtsai@dbtsai.com> Closes #959 from dbtsai/dbtsai-typo and squashes the following commits: fab0e0e [DB Tsai] Fixed typo	2014-06-03 18:10:58 -07:00
Ankur Dave	b1feb60209	[SPARK-1991] Support custom storage levels for vertices and edges This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed. The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels. In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods. I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed. Author: Ankur Dave <ankurdave@gmail.com> Closes #946 from ankurdave/SPARK-1991 and squashes the following commits: ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks" 34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks 6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges	2014-06-03 14:54:26 -07:00
Joseph E. Gonzalez	894ecde04f	Synthetic GraphX Benchmark This PR accomplishes two things: 1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph. This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets 2. This PR improves the implementation of the log-normal graph generator. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Author: Ankur Dave <ankurdave@gmail.com> Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits: e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 bccccad [Ankur Dave] Fix long lines 374678a [Ankur Dave] Bugfix and style changes 1bdf39a [Joseph E. Gonzalez] updating options d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder. f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.	2014-06-03 14:14:48 -07:00
baishuo(白硕)	aa41a522d8	fix java.lang.ClassCastException get Exception when run：bin/run-example org.apache.spark.examples.sql.RDDRelation Exception's detail is: Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145) at org.apache.spark.examples.sql.RDDRelation$.main(RDDRelation.scala:49) at org.apache.spark.examples.sql.RDDRelation.main(RDDRelation.scala) change sql("SELECT COUNT() FROM records").collect().head.getInt(0) to sql("SELECT COUNT() FROM records").collect().head.getLong(0), then the Exception do not occur any more Author: baishuo(白硕) <vc_java@hotmail.com> Closes #949 from baishuo/master and squashes the following commits: f4b319f [baishuo(白硕)] fix java.lang.ClassCastException	2014-06-03 13:39:47 -07:00
Erik Selin	8edc9d0330	[SPARK-1468] Modify the partition function used by partitionBy. Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes. Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468 Author: Erik Selin <erik.selin@jadedpixel.com> Closes #371 from tyro89/consistent_hashing and squashes the following commits: 201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.	2014-06-03 13:31:16 -07:00
tzolov	b1f285359a	Add support for Pivotal HD in the Maven build: SPARK-1992 Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run: ``` mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package ``` Author: tzolov <christian.tzolov@gmail.com> Closes #942 from tzolov/master and squashes the following commits: bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]	2014-06-03 13:26:29 -07:00
Wenchen Fan(Cloud)	45e9bc85db	[SPARK-1912] fix compress memory issue during reduce When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily. Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> Closes #860 from cloud-fan/fix-compress and squashes the following commits: 0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator' `07f32c2` [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class 2c8adb2 [Wenchen Fan(Cloud)] add inline comment 8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce	2014-06-03 13:18:20 -07:00
Henry Saputra	6c044ed100	SPARK-2001 : Remove docs/spark-debugger.md from master Per discussion in dev list: " Seemed like the spark-debugger.md is no longer accurate (see http://spark.apache.org/docs/latest/spark-debugger.html) and since it was originally written Spark has evolved that makes the doc obsolete. There are already work pending for new replay debugging (I could not find the PR links for it) so I With version control we could always reinstate the old doc if needed, but as of today the doc is no longer reflect the current state of Spark's RDD. " Author: Henry Saputra <henry.saputra@gmail.com> Closes #953 from hsaputra/SPARK-2001-hsaputra and squashes the following commits: dc324aa [Henry Saputra] SPARK-2001 : Remove docs/spark-debugger.md from master since it is obsolete	2014-06-03 13:03:51 -07:00
Syed Hashmi	7782a304ad	[SPARK-1942] Stop clearing spark.driver.port in unit tests stop resetting spark.driver.port in unit tests (scala, java and python). Author: Syed Hashmi <shashmi@cloudera.com> Author: CodingCat <zhunansjtu@gmail.com> Closes #943 from syedhashmi/master and squashes the following commits: 885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool) b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master' b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner" 57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner" 1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests 4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread" fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner 6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread 4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner	2014-06-03 12:04:47 -07:00
Cheng Lian	862283e9cc	Avoid dynamic dispatching when unwrapping Hive data. This is a follow up of PR #758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.	2014-06-02 19:20:23 -07:00
egraldlo	ec8be274a7	[SPARK-1995][SQL] system function upper and lower can be supported I don't know whether it's time to implement system function about string operation in spark sql now. Author: egraldlo <egraldlo@gmail.com> Closes #936 from egraldlo/stringoperator and squashes the following commits: 3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser ea76d0a [egraldlo] modify the formatting issues b49f25e [egraldlo] modify the formatting issues 1f0bbb5 [egraldlo] system function upper and lower supported 13d3267 [egraldlo] system function upper and lower supported	2014-06-02 18:02:57 -07:00
Cheng Lian	d000ca98a8	[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #939 from liancheng/spark-1958 and squashes the following commits: bdc4a14 [Cheng Lian] Copy rows to present immutable data to users 8250976 [Cheng Lian] Added return type explicitly for public API 192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.	2014-06-02 12:09:43 -07:00
Tor Myklebust	9a5d482e09	[SPARK-1553] Alternating nonnegative least-squares This pull request includes a nonnegative least-squares solver (NNLS) tailored to the kinds of small-scale problems that come up when training matrix factorisation models by alternating nonnegative least-squares (ANNLS). The method used for the NNLS subproblems is based on the classical method of projected gradients. There is a modification where, if the set of active constraints has not changed since the last iteration, a conjugate gradient step is considered and possibly rejected in favour of the gradient; this improves convergence once the optimal face has been located. The NNLS solver is in `org.apache.spark.mllib.optimization.NNLSbyPCG`. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #460 from tmyklebu/annls and squashes the following commits: 79bc4b5 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into annls 199b0bc [Tor Myklebust] Make the ctor private again and use the builder pattern. 7fbabf1 [Tor Myklebust] Cleanup matrix math in NNLSSuite. 65ef7f2 [Tor Myklebust] Make ALS's ctor public and remove a couple of "convenience" wrappers. 2d4f3cb [Tor Myklebust] Cleanup. 0cb4481 [Tor Myklebust] Drop the iteration limit from 40k to max(400,20n). e2a01d1 [Tor Myklebust] Create a workspace object for NNLS to cut down on memory allocations. b285106 [Tor Myklebust] Clean up NNLS test cases. 9c820b6 [Tor Myklebust] Tweak variable names. 8a1a436 [Tor Myklebust] Describe the problem and add a reference to Polyak's paper. 5345402 [Tor Myklebust] Style fixes that got eaten. ac673bd [Tor Myklebust] More safeguards against numerical ridiculousness. c288b6a [Tor Myklebust] Finish moving the NNLS solver. 9a82fa6 [Tor Myklebust] Fix scalastyle moanings. 33bf4f2 [Tor Myklebust] Fix missing space. 89ea0a8 [Tor Myklebust] Hack ALSSuite to support NNLS testing. f5dbf4d [Tor Myklebust] Teach ALS how to use the NNLS solver. 6cb563c [Tor Myklebust] Tests for the nonnegative least squares solver. a68ac10 [Tor Myklebust] A nonnegative least-squares solver.	2014-06-02 11:48:09 -07:00
Ankur Dave	9535f4045d	Add landmark-based Shortest Path algorithm to graphx.lib This is a modified version of apache/spark#10. Author: Ankur Dave <ankurdave@gmail.com> Author: Andres Perez <andres@tresata.com> Closes #933 from ankurdave/shortestpaths and squashes the following commits: 03a103c [Ankur Dave] Style fixes 7a1ff48 [Ankur Dave] Improve ShortestPaths documentation d75c8fc [Ankur Dave] Remove unnecessary VD type param, and pass through ED d983fb4 [Ankur Dave] Fix style errors 60ed8e6 [Andres Perez] Add Shortest-path computations to graphx.lib with unit tests.	2014-06-02 00:00:24 -07:00
Patrick Wendell	d17d221487	Better explanation for how to use MIMA excludes. This patch does a few things: 1. We have a file MimaExcludes.scala exclusively for excludes. 2. The test runner tells users about that file if a test fails. 3. I've added back the excludes used from 0.9->1.0. We should keep these in the project as an official audit trail of times where we decided to make exceptions. Author: Patrick Wendell <pwendell@gmail.com> Closes #937 from pwendell/mima and squashes the following commits: 7ee0db2 [Patrick Wendell] Better explanation for how to use MIMA excludes.	2014-06-01 17:27:05 -07:00
Reynold Xin	eea3aab4f2	Made spark_ec2.py PEP8 compliant. The change set is actually pretty small -- mostly whitespace changes. Admittedly this is a scary change due to the lack of tests to cover the ec2 scripts, and also because indentation actually impacts control flow in Python ... Look at changes without whitespace diff here: https://github.com/apache/spark/pull/891/files?w=1 Author: Reynold Xin <rxin@apache.org> Closes #891 from rxin/spark-ec2-pep8 and squashes the following commits: ac1bf11 [Reynold Xin] Made spark_ec2.py PEP8 compliant.	2014-06-01 15:39:04 -07:00
Yadid Ayzenberg	366c0c4c30	updated java code blocks in spark SQL guide such that ctx will refer to ... ...a JavaSparkContext and sqlCtx will refer to a JavaSQLContext Author: Yadid Ayzenberg <yadid@media.mit.edu> Closes #932 from yadid/master and squashes the following commits: f92fb3a [Yadid Ayzenberg] updated java code blocks in spark SQL guide such that ctx will refer to a JavaSparkContext and sqlCtx will refer to a JavaSQLContext	2014-05-31 19:44:13 -07:00
Uri Laserson	5e98967b61	SPARK-1917: fix PySpark import of scipy.special functions https://issues.apache.org/jira/browse/SPARK-1917 Author: Uri Laserson <laserson@cloudera.com> Closes #866 from laserson/SPARK-1917 and squashes the following commits: d947e8c [Uri Laserson] Added test for scipy.special importing 1798bbd [Uri Laserson] SPARK-1917: fix PySpark import of scipy.special	2014-05-31 14:59:09 -07:00
witgo	d8c005d537	Improve maven plugin configuration Author: witgo <witgo@qq.com> Closes #786 from witgo/maven_plugin and squashes the following commits: 5de86a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into maven_plugin c35ef73 [witgo] Improve maven plugin configuration	2014-05-31 14:36:27 -07:00
Aaron Davidson	9909efc10a	SPARK-1839: PySpark RDD#take() shouldn't always read from driver This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.) Author: Aaron Davidson <aaron@databricks.com> Closes #922 from aarondav/take and squashes the following commits: fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver	2014-05-31 13:04:57 -07:00
Aaron Davidson	7d52777eff	Super minor: Close inputStream in SparkSubmitArguments `Properties#load()` doesn't close the InputStream, but it'd be closed after being GC'd anyway... Also changed file.getName to file, because getName only shows the filename. This will show the full (possibly relative) path, which is less confusing if it's not found. Author: Aaron Davidson <aaron@databricks.com> Closes #914 from aarondav/tiny and squashes the following commits: db9d072 [Aaron Davidson] Super minor: Close inputStream in SparkSubmitArguments	2014-05-31 12:36:58 -07:00
Michael Armbrust	1a0da0ec57	[SQL] SPARK-1964 Add timestamp to hive metastore type parser. Author: Michael Armbrust <michael@databricks.com> Closes #913 from marmbrus/timestampMetastore and squashes the following commits: 8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.	2014-05-31 12:34:22 -07:00
Michael Armbrust	7463cd248f	Optionally include Hive as a dependency of the REPL. Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt. That said, it is kinda nice to be able to launch all things directly from SBT when developing. Author: Michael Armbrust <michael@databricks.com> Closes #801 from marmbrus/hiveRepl and squashes the following commits: 9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL.	2014-05-31 12:24:35 -07:00
Takuya UESHIN	3ce81494c5	[SPARK-1947] [SQL] Child of SumDistinct or Average should be widened to prevent overflows the same as Sum. Child of `SumDistinct` or `Average` should be widened to prevent overflows the same as `Sum`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #902 from ueshin/issues/SPARK-1947 and squashes the following commits: 99c3dcb [Takuya UESHIN] Insert Cast for SumDistinct and Average.	2014-05-31 11:30:03 -07:00
Chen Chao	9ecc40d3ae	correct tiny comment error Author: Chen Chao <crazyjvm@gmail.com> Closes #928 from CrazyJvm/patch-8 and squashes the following commits: 144328b [Chen Chao] correct tiny comment error	2014-05-31 00:06:49 -07:00
Cheng Lian	cf989601d0	[SPARK-1959] String "NULL" shouldn't be interpreted as null value JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959) Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #909 from liancheng/spark-1959 and squashes the following commits: 306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as null value	2014-05-30 22:13:11 -07:00
CodingCat	41bfdda3cc	SPARK-1976: fix the misleading part in streaming docs Spark streaming requires at least two working threads, but the document gives the example like import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import org.apache.spark.streaming.api._ // Create a StreamingContext with a local master val ssc = new StreamingContext("local", "NetworkWordCount", Seconds(1)) http://spark.apache.org/docs/latest/streaming-programming-guide.html Author: CodingCat <zhunansjtu@gmail.com> Closes #924 from CodingCat/master and squashes the following commits: bb89f20 [CodingCat] update streaming docs	2014-05-30 22:06:08 -07:00
nchammas	23ae36630a	updated link to mailing list Author: nchammas <nicholas.chammas@gmail.com> Closes #923 from nchammas/patch-1 and squashes the following commits: 65c4d18 [nchammas] updated link to mailing list	2014-05-30 22:04:57 -07:00
Andrew Ash	9c1f204d80	Typo: and -> an Author: Andrew Ash <andrew@andrewash.com> Closes #927 from ash211/patch-5 and squashes the following commits: 79b577d [Andrew Ash] Typo: and -> an	2014-05-30 22:02:04 -07:00
Zhen Peng	ff562b2396	[SPARK-1901] worker should make sure executor has exited before updating executor's info https://issues.apache.org/jira/browse/SPARK-1901 Author: Zhen Peng <zhenpeng01@baidu.com> Closes #854 from zhpengg/bugfix-worker-kills-executor and squashes the following commits: 21d380b [Zhen Peng] add some error messages 506cea6 [Zhen Peng] add some docs for killProcess() a0b9860 [Zhen Peng] [SPARK-1901] worker should make sure executor has exited before updating executor's info	2014-05-30 10:12:51 -07:00
Prashant Sharma	79fa8fd4b1	[SPARK-1971] Update MIMA to compare against Spark 1.0.0 Author: Prashant Sharma <prashant.s@imaginea.com> Closes #910 from ScrapCodes/enable-mima/spark-core and squashes the following commits: 79f3687 [Prashant Sharma] updated Mima to check against version 1.0 1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.	2014-05-30 01:13:51 -07:00
Matei Zaharia	c8bf4131bc	[SPARK-1566] consolidate programming guide, and general doc updates This is a fairly large PR to clean up and update the docs for 1.0. The major changes are: * A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs * New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark * Spark-submit guide moved to a separate page and expanded slightly * Various cleanups of the menu system, security docs, and others * Updated look of title bar to differentiate the docs from previous Spark versions You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html. Author: Matei Zaharia <matei@databricks.com> Closes #896 from mateiz/1.0-docs and squashes the following commits: 03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs 0779508 [Matei Zaharia] tweak ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks 1bf4112 [Matei Zaharia] Review comments 4414f88 [Matei Zaharia] tweaks d04e979 [Matei Zaharia] Fix some old links to Java guide a34ed33 [Matei Zaharia] tweak 541bb3b [Matei Zaharia] miscellaneous changes fcefdec [Matei Zaharia] Moved submitting apps to separate doc 61d72b4 [Matei Zaharia] stuff 181f217 [Matei Zaharia] migration guide, remove old language guides e11a0da [Matei Zaharia] Add more API functions 6a030a9 [Matei Zaharia] tweaks 8db0ae3 [Matei Zaharia] Added key-value pairs section 318d2c9 [Matei Zaharia] tweaks 1c81477 [Matei Zaharia] New section on basics and function syntax e38f559 [Matei Zaharia] Actually added programming guide to Git a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout 3b6a876 [Matei Zaharia] More CSS tweaks 01ec8bf [Matei Zaharia] More CSS tweaks e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0	2014-05-30 00:34:33 -07:00
Prashant Sharma	eeee978a34	[SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware. We add all the classes annotated as `DeveloperApi` to `~/.mima-excludes`. Author: Prashant Sharma <prashant.s@imaginea.com> Author: nikhil7sh <nikhilsharmalnmiit@gmail.ccom> Closes #904 from ScrapCodes/SPARK-1820/ignore-Developer-Api and squashes the following commits: de944f9 [Prashant Sharma] Code review. e3c5215 [Prashant Sharma] Incorporated patrick's suggestions and fixed the scalastyle build. 9983a42 [nikhil7sh] [SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware	2014-05-29 23:20:20 -07:00
Ankur Dave	b7e28fa451	initial version of LPA A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework. Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach. Author: Ankur Dave <ankurdave@gmail.com> Author: haroldsultan <haroldsultan@gmail.com> Author: Harold Sultan <haroldsultan@gmail.com> Closes #905 from haroldsultan/master and squashes the following commits: 327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation 227a4d0 [Ankur Dave] Untabify 0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation 0e24303 [Ankur Dave] Add LabelPropagationSuite 84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA 9830342 [Harold Sultan] initial version of LPA	2014-05-29 15:39:25 -07:00
Cheng Lian	8f7141fbc0	[SPARK-1368][SQL] Optimized HiveTableScan JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368) This PR introduces two major updates: - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`. - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning. My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively: ``` Original: [info] CSV: 27676 ms, RCFile: 26415 ms [info] CSV: 27703 ms, RCFile: 26029 ms [info] CSV: 27511 ms, RCFile: 25962 ms Optimized: [info] CSV: 13820 ms, RCFile: 10402 ms [info] CSV: 14158 ms, RCFile: 10691 ms [info] CSV: 13606 ms, RCFile: 10346 ms ``` The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively. Preparation code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanPrepare extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("drop table scan_csv") hql("drop table scan_rcfile") hql("""create table scan_csv (key int, value string) \| row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' \| with serdeproperties ('field.delim'=',') """.stripMargin) hql(s"""load data local inpath "${args(0)}" into table scan_csv""") hql("""create table scan_rcfile (key int, value string) \| row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' \|stored as \| inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' \| outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' """.stripMargin) hql( """ \|from scan_csv \|insert overwrite table scan_rcfile \|select scan_csv.key, scan_csv.value """.stripMargin) } ``` Benchmark code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanBenchmark extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ val scanCsv = hql("select key from scan_csv") val scanRcfile = hql("select key from scan_rcfile") val csvDuration = benchmark(scanCsv.count()) val rcfileDuration = benchmark(scanRcfile.count()) println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms") def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } } ``` @marmbrus Please help review, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #758 from liancheng/fastHiveTableScan and squashes the following commits: 4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest cf640d8 [Cheng Lian] More HiveTableScan optimisations: bf0e7dc [Cheng Lian] Added SortedOperation pattern to match some definitely sorted operations and avoid some sorting cost in HiveComparisonTest. 6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan	2014-05-29 15:24:03 -07:00

... 13 14 15 16 17 ...

7800 commits