ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kousuke Saruta	4feb46c5fe	[SPARK-3401][PySpark] Wrong usage of tee command in python/run-tests Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2272 from sarutak/SPARK-3401 and squashes the following commits: 2b35a59 [Kousuke Saruta] Modified wrong usage of tee command in python/run-tests	2014-09-04 10:29:11 -07:00
GuoQiang Li	905861906e	[Minor]Remove extra semicolon in FlumeStreamSuite.scala Author: GuoQiang Li <witgo@qq.com> Closes #2265 from witgo/FlumeStreamSuite and squashes the following commits: 6c99e6e [GuoQiang Li] Remove extra semicolon in FlumeStreamSuite.scala	2014-09-04 10:28:23 -07:00
Ankur Dave	00362dac97	[HOTFIX] [SPARK-3400] Revert `9b225ac` "fix GraphX EdgeRDD zipPartitions" `9b225ac307` has been causing GraphX tests to fail nondeterministically, which is blocking development for others. Author: Ankur Dave <ankurdave@gmail.com> Closes #2271 from ankurdave/SPARK-3400 and squashes the following commits: 10c2a97 [Ankur Dave] [HOTFIX] [SPARK-3400] Revert `9b225ac` "fix GraphX EdgeRDD zipPartitions"	2014-09-03 23:49:47 -07:00
Kousuke Saruta	1bed0a3869	[SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due to multi-byte character contained in Gradient.scala Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2248 from sarutak/SPARK-3372 and squashes the following commits: 73a28b8 [Kousuke Saruta] Replaced UTF-8 hyphen with ascii hyphen	2014-09-03 20:47:00 -07:00
Matthew Farrellee	7c6e71f05f	[SPARK-2435] Add shutdown hook to pyspark Author: Matthew Farrellee <matt@redhat.com> Closes #2183 from mattf/SPARK-2435 and squashes the following commits: ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark	2014-09-03 19:37:37 -07:00
Davies Liu	c5cbc49233	[SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF After this patch, broadcast can be used in Python UDF. Author: Davies Liu <davies.liu@gmail.com> Closes #2243 from davies/udf_broadcast and squashes the following commits: 7b88861 [Davies Liu] support broadcast in UDF	2014-09-03 19:08:39 -07:00
Cheng Lian	248067adbe	[SPARK-2961][SQL] Use statistics to prune batches within cached partitions This PR is based on #1883 authored by marmbrus. Key differences: 1. Batch pruning instead of partition pruning When #1883 was authored, batched column buffer building (#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition). 1. More filters are supported Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2188 from liancheng/in-mem-batch-pruning and squashes the following commits: 68cf019 [Cheng Lian] Marked sqlContext as @transient 4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite 3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default 062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup 16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions 16195c5 [Cheng Lian] Enabled both disjunction and conjunction 89950d0 [Cheng Lian] Worked around Scala style check 9c167f6 [Cheng Lian] Minor code cleanup 3c4d5c7 [Cheng Lian] Minor code cleanup ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite fc517d0 [Cheng Lian] More test cases 1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes 385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning	2014-09-03 18:59:26 -07:00
Cheng Lian	f48420fde5	[SPARK-2973][SQL] Lightweight SQL commands without distributed jobs when calling .collect() By overriding `executeCollect()` in physical plan classes of all commands, we can avoid to kick off a distributed job when collecting result of a SQL command, e.g. `sql("SET").collect()`. Previously, `Command.sideEffectResult` returns a `Seq[Any]`, and the `execute()` method in sub-classes of `Command` typically convert that to a `Seq[Row]` then parallelize it to an RDD. Now with this PR, `sideEffectResult` is required to return a `Seq[Row]` directly, so that `executeCollect()` can directly leverage that and be factored to the `Command` parent class. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2215 from liancheng/lightweight-commands and squashes the following commits: 3fbef60 [Cheng Lian] Factored execute() method of physical commands to parent class Command 5a0e16c [Cheng Lian] Passes test suites e0e12e9 [Cheng Lian] Refactored Command.sideEffectResult and Command.executeCollect 995bdd8 [Cheng Lian] Cleaned up DescribeHiveTableCommand 542977c [Cheng Lian] Avoids confusion between logical and physical plan by adding package prefixes 55b2aa5 [Cheng Lian] Avoids distributed jobs when execution SQL commands	2014-09-03 18:57:20 -07:00
Kousuke Saruta	4bba10c41a	[SPARK-3233] Executor never stop its SparnEnv, BlockManager, ConnectionManager etc. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2138 from sarutak/SPARK-3233 and squashes the following commits: c0205b7 [Kousuke Saruta] Merge branch 'SPARK-3233' of github.com:sarutak/spark into SPARK-3233 064679d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 d3005fd [Kousuke Saruta] Modified Class definition format of BlockManagerMaster 039b747 [Kousuke Saruta] Modified style 889e2d1 [Kousuke Saruta] Modified BlockManagerMaster to be able to be past isDriver flag `4da8535` [Kousuke Saruta] Modified BlockManagerMaster#stop to send StopBlockManagerMaster message when sender is Driver 6518c3a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 d5ab19a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 6bce25c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 6058a58 [Kousuke Saruta] Modified Executor not to invoke SparkEnv#stop in local mode e5ad9d3 [Kousuke Saruta] Modified Executor to stop SparnEnv at the end of itself	2014-09-03 18:42:01 -07:00
scwf	e08ea7393d	[SPARK-3303][core] fix SparkContextSchedulerCreationSuite test error run test with the master branch with this command when mesos native lib is set sbt/sbt -Phive "test-only org.apache.spark.SparkContextSchedulerCreationSuite" get this error: [info] SparkContextSchedulerCreationSuite: [info] - bad-master [info] - local [info] - local-* [info] - local-n [info] - local--n-failures [info] - local-n-failures [info] - bad-local-n [info] - bad-local-n-failures [info] - local-default-parallelism [info] - simr [info] - local-cluster [info] - yarn-cluster [info] - yarn-standalone [info] - yarn-client [info] - mesos fine-grained [info] - mesos coarse-grained FAILED * [info] Executor Spark home `spark.mesos.executor.home` is not set! Since `executorSparkHome` only used in `createCommand`, move `val executorSparkHome...` to `createCommand` to fix this issue. Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei_hello@126.com> Closes #2199 from scwf/SparkContextSchedulerCreationSuite and squashes the following commits: ef1de22 [scwf] fix code fomate 19d26f3 [scwf] fix conflict d9a8a60 [wangfei] fix SparkContextSchedulerCreationSuite test error	2014-09-03 18:39:13 -07:00
Tathagata Das	a522407928	[SPARK-2419][Streaming][Docs] Updates to the streaming programming guide Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Jacek Laskowski <jacek@japila.pl> Closes #2254 from tdas/streaming-doc-fix and squashes the following commits: e45c6d7 [Jacek Laskowski] More fixes from an old PR 5125316 [Tathagata Das] Fixed links dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes. acbc3e3 [Tathagata Das] Fixed links between streaming guides. cb7007f [Tathagata Das] Added Streaming + Flume integration guide. 9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419.	2014-09-03 17:38:01 -07:00
Liang-Chi Hsieh	996b7434ee	[SPARK-3345] Do correct parameters for ShuffleFileGroup In the method `newFileGroup` of class `FileShuffleBlockManager`, the parameters for creating new `ShuffleFileGroup` object is in wrong order. Because in current codes, the parameters `shuffleId` and `fileId` are not used. So it doesn't cause problem now. However it should be corrected for readability and avoid future problem. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #2235 from viirya/correct_shufflefilegroup_params and squashes the following commits: fe72567 [Liang-Chi Hsieh] Do correct parameters for ShuffleFileGroup.	2014-09-03 17:04:53 -07:00
Andrew Or	2784822e4c	[Minor] Fix outdated Spark version This is causing the event logs to include a file called SPARK_VERSION_1.0.0, which is not accurate. Author: Andrew Or <andrewor14@gmail.com> Author: andrewor14 <andrewor14@gmail.com> Closes #2255 from andrewor14/spark-version and squashes the following commits: 1fbdfe9 [andrewor14] Snapshot 805a1c8 [Andrew Or] JK. Update Spark version to 1.2.0 instead. bffbaab [Andrew Or] Update Spark version to 1.1.0	2014-09-03 16:58:19 -07:00
Marcelo Vanzin	f2b5b619a9	[SPARK-3388] Expose aplication ID in ApplicationStart event, use it in history server. This change exposes the application ID generated by the Spark Master, Mesos or Yarn via the SparkListenerApplicationStart event. It then uses that information to expose the application via its ID in the history server, instead of using the internal directory name generated by the event logger as an application id. This allows someone who knows the application ID to easily figure out the URL for the application's entry in the HS, aside from looking better. In Yarn mode, this is used to generate a direct link from the RM application list to the Spark history server entry (thus providing a fix for SPARK-2150). Note this sort of assumes that the different managers will generate app ids that are sufficiently different from each other that clashes will not occur. Author: Marcelo Vanzin <vanzin@cloudera.com> This patch had conflicts when merged, resolved by Committer: Andrew Or <andrewor14@gmail.com> Closes #1218 from vanzin/yarn-hs-link-2 and squashes the following commits: 2d19f3c [Marcelo Vanzin] Review feedback. 6706d3a [Marcelo Vanzin] Implement applicationId() in base classes. 56fe42e [Marcelo Vanzin] Fix cluster mode history address, plus a cleanup. 44112a8 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 8278316 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 a86bbcf [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 a0056e6 [Marcelo Vanzin] Unbreak test. 4b10cfd [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 cb0cab2 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 25f2826 [Marcelo Vanzin] Add MIMA excludes. f0ba90f [Marcelo Vanzin] Use BufferedIterator. c90a08d [Marcelo Vanzin] Remove unused code. 3f8ec66 [Marcelo Vanzin] Review feedback. 21aa71b [Marcelo Vanzin] Fix JSON test. b022bae [Marcelo Vanzin] Undo SparkContext cleanup. c6d7478 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 4e3483f [Marcelo Vanzin] Fix test. 57517b8 [Marcelo Vanzin] Review feedback. Mostly, more consistent use of Scala's Option. 311e49d [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2 d35d86f [Marcelo Vanzin] Fix yarn backend after rebase. 36dc362 [Marcelo Vanzin] Don't use Iterator::takeWhile(). 0afd696 [Marcelo Vanzin] Wait until master responds before returning from start(). abc4697 [Marcelo Vanzin] Make FsHistoryProvider keep a map of applications by id. 26b266e [Marcelo Vanzin] Use Mesos framework ID as Spark application ID. b3f3664 [Marcelo Vanzin] [yarn] Make the RM link point to the app direcly in the HS. 2fb7de4 [Marcelo Vanzin] Expose the application ID in the ApplicationStart event. ed10348 [Marcelo Vanzin] Expose application id to spark context.	2014-09-03 14:57:38 -07:00
Marcelo Vanzin	ccc69e26ec	[SPARK-2845] Add timestamps to block manager events. These are not used by the UI but are useful when analysing the logs from a spark job. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #654 from vanzin/bm-event-tstamp and squashes the following commits: d5d6e66 [Marcelo Vanzin] Fix tests. ec06218 [Marcelo Vanzin] Review feedback. f134dbc [Marcelo Vanzin] Merge branch 'master' into bm-event-tstamp b495b7c [Marcelo Vanzin] Merge branch 'master' into bm-event-tstamp 7d2fe9e [Marcelo Vanzin] Review feedback. d6f381c [Marcelo Vanzin] Update tests added after patch was created. 45e3bf8 [Marcelo Vanzin] Fix unit test after merge. b37a10f [Marcelo Vanzin] Use === in test assertions. ef72824 [Marcelo Vanzin] Handle backwards compatibility with 1.0.0. aca1151 [Marcelo Vanzin] Fix unit test to check new fields. efdda8e [Marcelo Vanzin] Add timestamps to block manager events.	2014-09-03 14:47:11 -07:00
RJ Nowling	e5d376801d	[SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR #720 PR #720 made multiple changes to GraphGenerator.logNormalGraph including: * Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect. * Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated. * Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter. * Failed to update scala docs and programming guide for API changes * Added a Synthetic Benchmark in the examples. This PR: * Removes the in-line calls and calls original vertex / edge generation functions again * Adds an optional seed parameter for deterministic behavior (when desired) * Keeps the number of partitions parameter that was added. * Keeps compatibility with the synthetic benchmark example * Maintains backwards-compatible API Author: RJ Nowling <rnowling@gmail.com> Author: Ankur Dave <ankurdave@gmail.com> Closes #2168 from rnowling/graphgenrand and squashes the following commits: f1cd79f [Ankur Dave] Style fixes e11918e [RJ Nowling] Fix bad comparisons in unit tests 785ac70 [RJ Nowling] Fix style error c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed 41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed 799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal 43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges 2faf75f [RJ Nowling] Added unit test for logNormalGraph 82f22397 [RJ Nowling] Add unit test for sampleLogNormal b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing 6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges 1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed 1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate 98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters d40141a [RJ Nowling] Fix style error 684804d [RJ Nowling] revert PR #720 which introduce errors in logNormalGraph and messed up seeding of RNGs. Add user-defined optional seed for deterministic behavior c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed. 015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit `894ecde04`	2014-09-03 14:16:06 -07:00
Davies Liu	6481d27425	[SPARK-3309] [PySpark] Put all public API in __all__ Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs. Author: Davies Liu <davies.liu@gmail.com> Closes #2205 from davies/public and squashes the following commits: c6c5567 [Davies Liu] fix message f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module 7e3016a [Davies Liu] add __all__ in mllib 6281b48 [Davies Liu] fix doc for SchemaRDD 6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py	2014-09-03 11:49:45 -07:00
Marcelo Vanzin	6a72a36940	[SPARK-3187] [yarn] Cleanup allocator code. Move all shared logic to the base YarnAllocator class, and leave the version-specific logic in the version-specific module. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2169 from vanzin/SPARK-3187 and squashes the following commits: 46c2826 [Marcelo Vanzin] Hide the privates. 4dc9c83 [Marcelo Vanzin] Actually release containers. 8b1a077 [Marcelo Vanzin] Changes to the Yarn alpha allocator. f3f5f1d [Marcelo Vanzin] [SPARK-3187] [yarn] Cleanup allocator code.	2014-09-03 08:22:50 -05:00
Patrick Wendell	c64cc435e2	SPARK-3358: [EC2] Switch back to HVM instances for m3.X. During regression tests of Spark 1.1 we discovered perf issues with PVM instances when running PySpark. This reverts a change added in #1156 which changed the default type for m3 instances to PVM. Author: Patrick Wendell <pwendell@gmail.com> Closes #2244 from pwendell/ec2-hvm and squashes the following commits: 1342d7e [Patrick Wendell] SPARK-3358: [EC2] Switch back to HVM instances for m3.X.	2014-09-02 21:30:09 -07:00
Liang-Chi Hsieh	24ab384018	[SPARK-3300][SQL] No need to call clear() and shorten build() The function `ensureFreeSpace` in object `ColumnBuilder` clears old buffer before copying its content to new buffer. This PR fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #2195 from viirya/fix_buffer_clear and squashes the following commits: 792f009 [Liang-Chi Hsieh] no need to call clear(). use flip() instead of calling limit(), position() and rewind(). df2169f [Liang-Chi Hsieh] should clean old buffer after copying its content.	2014-09-02 20:51:25 -07:00
Cheng Lian	19d3e1e8e9	[SQL] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats Class names of these two are just too similar. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2189 from liancheng/column-metrics and squashes the following commits: 8bb3b21 [Cheng Lian] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats	2014-09-02 20:49:36 -07:00
Takuya UESHIN	0cd91f666d	[SPARK-3341][SQL] The dataType of Sqrt expression should be DoubleType. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2233 from ueshin/issues/SPARK-3341 and squashes the following commits: e497320 [Takuya UESHIN] Fix data type of Sqrt expression.	2014-09-02 20:31:15 -07:00
luluorta	9b225ac307	[SPARK-2823][GraphX]fix GraphX EdgeRDD zipPartitions If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions Author: luluorta <luluorta@gmail.com> Closes #1763 from luluorta/fix-graph-zip and squashes the following commits: 8338961 [luluorta] fix GraphX EdgeRDD zipPartitions	2014-09-02 19:26:27 -07:00
Tathagata Das	e9bb12bea9	[SPARK-1981][Streaming][Hotfix] Fixed docs related to kinesis - Include kinesis in the unidocs - Hide non-public classes from docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2239 from tdas/kinesis-doc-fix and squashes the following commits: 156e20c [Tathagata Das] More fixes, based on PR comments. e9a6c01 [Tathagata Das] Fixed docs related to kinesis	2014-09-02 19:02:48 -07:00
Larry Xiao	aa7de128c5	[SPARK-2981][GraphX] EdgePartition1D Int overflow minor fix detail is here: https://issues.apache.org/jira/browse/SPARK-2981 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1902 from larryxiao/2981 and squashes the following commits: 88059a2 [Larry Xiao] [SPARK-2981][GraphX] EdgePartition1D Int overflow	2014-09-02 18:50:52 -07:00
uncleGen	7c9bbf1725	[SPARK-3123][GraphX]: override the "setName" function to set EdgeRDD's name manually just as VertexRDD does. Author: uncleGen <hustyugm@gmail.com> Closes #2033 from uncleGen/master_origin and squashes the following commits: 801994b [uncleGen] Update EdgeRDD.scala	2014-09-02 18:44:58 -07:00
Larry Xiao	7c92b49d6b	[SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples to support ~/spark/bin/run-example GraphXAnalytics triangles /soc-LiveJournal1.txt --numEPart=256 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1766 from larryxiao/1986 and squashes the following commits: bb77cd9 [Larry Xiao] [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples	2014-09-02 18:29:08 -07:00
Prudhvi Krishna	644e31524a	SPARK-3328 fixed make-distribution script --with-tachyon option. Directory path for dependencies jar and resources in Tachyon 0.5.0 has been changed. Author: Prudhvi Krishna <prudhvi953@gmail.com> Closes #2228 from prudhvije/SPARK-3328/make-dist-fix and squashes the following commits: d1d2c22 [Prudhvi Krishna] SPARK-3328 fixed make-distribution script --with-tachyon option.	2014-09-02 17:36:53 -07:00
Davies Liu	e2c901b4c7	[SPARK-2871] [PySpark] add countApproxDistinct() API RDD.countApproxDistinct(relativeSD=0.05): :: Experimental :: Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>. This support all the types of objects, which is supported by Pyrolite, nearly all builtin types. param relativeSD Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 950 < n < 1050 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 18 < n < 22 True Author: Davies Liu <davies.liu@gmail.com> Closes #2142 from davies/countApproxDistinct and squashes the following commits: e20da47 [Davies Liu] remove the correction in Python c38c4e4 [Davies Liu] fix doc tests 2ab157c [Davies Liu] fix doc tests 9d2565f [Davies Liu] add commments and link for hash collision correction d306492 [Davies Liu] change range of hash of tuple to [0, maxint] ded624f [Davies Liu] calculate hash in Python 4cba98f [Davies Liu] add more tests a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct e97e342 [Davies Liu] add countApproxDistinct()	2014-09-02 15:47:47 -07:00
Sandy Ryza	81b9d5b628	SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... ...job fails while reading from Hadoop Author: Sandy Ryza <sandy@cloudera.com> Closes #1956 from sryza/sandy-spark-3052 and squashes the following commits: 815813a [Sandy Ryza] SPARK-3052. Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop	2014-09-02 11:34:55 -07:00
Marcelo Vanzin	066f31a6b2	[SPARK-3347] [yarn] Fix yarn-alpha compilation. Missing import. Oops. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2236 from vanzin/SPARK-3347 and squashes the following commits: 594fc39 [Marcelo Vanzin] [SPARK-3347] [yarn] Fix yarn-alpha compilation.	2014-09-02 13:33:23 -05:00
Andrew Or	8f1f9aaf40	[SPARK-1919] Fix Windows spark-shell --jars We were trying to add `file:/C:/path/to/my.jar` to the class path. We should add `C:/path/to/my.jar` instead. Tested on Windows 8.1. Author: Andrew Or <andrewor14@gmail.com> Closes #2211 from andrewor14/windows-shell-jars and squashes the following commits: 262c6a2 [Andrew Or] Oops... Add the new code to the correct place 0d5a0c1 [Andrew Or] Format jar path only for adding to shell classpath 42bd626 [Andrew Or] Remove unnecessary code 0049f1b [Andrew Or] Remove embarrassing log messages b1755a0 [Andrew Or] Format jar paths properly before adding them to the classpath	2014-09-02 10:47:05 -07:00
Josh Rosen	378b2315b4	[SPARK-3061] Fix Maven build under Windows The Maven build was failing on Windows because it tried to call the unix `unzip` utility to extract the Py4J files into core's build directory. I've fixed this issue by using the `maven-antrun-plugin` to perform the unzipping. I also fixed an issue that prevented tests from running under Windows: In the Maven ScalaTest plugin, the filename listed in <filereports> is placed under the <reportsDirectory>; the current code places it in a subdirectory of reportsDirectory, e.g. ``` ${project.build.directory}/surefire-reports/${project.build.directory}/SparkTestSuite.txt ``` This caused problems under Windows because it would try to create a subdirectory named "c:\\". Note that the tests still fail under Windows (for other reasons); this PR just allows them to run and fail rather than crash when trying to create the test reports directory. Author: Josh Rosen <joshrosen@apache.org> Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #2165 from JoshRosen/windows-support and squashes the following commits: 651d210 [Josh Rosen] Unzip to python/build instead of core/build fbf3e61 [Josh Rosen] 4 spaces -> 2 spaces e347668 [Josh Rosen] Fix Maven scalatest filereports path: 4994af1 [Josh Rosen] [SPARK-3061] Use maven-antrun-plugin to unzip Py4J.	2014-09-02 10:45:14 -07:00
Sean Owen	32ec0a8cd4	SPARK-3331 [BUILD] PEP8 tests fail because they check unzipped py4j code PEP8 tests run on files under "./python", but unzipped py4j code is found at "./python/build/py4j". Py4J code fails style checks and can fail ./dev/run-tests if this code is present locally. Author: Sean Owen <sowen@cloudera.com> Closes #2222 from srowen/SPARK-3331 and squashes the following commits: 34711ec [Sean Owen] Restrict lint check to pyspark/, since the local directory can contain unzipped py4j code in build/py4j	2014-09-02 10:30:26 -07:00
Reza Zadeh	0f16b23cd1	[MLlib] Squash bug in IndexedRowMatrix Kill this bug fast before it does damage. Author: Reza Zadeh <rizlar@gmail.com> Closes #2224 from rezazadeh/indexrmbug and squashes the following commits: 53386d6 [Reza Zadeh] Squash bug in IndexedRowMatrix	2014-09-02 09:48:05 -07:00
lirui	fbf2678c16	SPARK-2636: Expose job ID in JobWaiter API This PR adds the async actions to the Java API. User can call these async actions to get the FutureAction and use JobWaiter (for SimpleFutureAction) to retrieve job Id. Author: lirui <rui.li@intel.com> Closes #2176 from lirui-intel/SPARK-2636 and squashes the following commits: ccaafb7 [lirui] SPARK-2636: fix java doc 5536d55 [lirui] SPARK-2636: mark the async API as experimental e2e01d5 [lirui] SPARK-2636: add mima exclude 0ca320d [lirui] SPARK-2636: fix method name & javadoc 3fa39f7 [lirui] SPARK-2636: refine the patch af4f5d9 [lirui] SPARK-2636: remove unused imports 843276c [lirui] SPARK-2636: only keep foreachAsync in the java API fbf5744 [lirui] SPARK-2636: add more async actions for java api 1b25abc [lirui] SPARK-2636: expose some fields in JobWaiter d09f732 [lirui] SPARK-2636: fix build eb1ee79 [lirui] SPARK-2636: change some parameters in SimpleFutureAction to member field 6e2b87b [lirui] SPARK-2636: add java API for async actions	2014-09-01 23:28:19 -07:00
Daniel Darabos	44d3a6a752	[SPARK-3342] Add SSDs to block device mapping On `m3.2xlarge` instances the 2x80GB SSDs are inaccessible if not added to the block device mapping when the instance is created. They work when added with this patch. I have not tested this with other instance types, and I do not know much about this script and EC2 deployment in general. Maybe this code needs to depend on the instance type. The requirement for this mapping is described in the AWS docs at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios "For M3 instances, you must specify instance store volumes in the block device mapping for the instance. When you launch an M3 instance, we ignore any instance store volumes specified in the block device mapping for the AMI." Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #2081 from darabos/patch-1 and squashes the following commits: 1ceb2c8 [Daniel Darabos] Use %d string interpolation instead of {}. a1854d7 [Daniel Darabos] Only specify ephemeral device mapping for M3. e0d9e37 [Daniel Darabos] Create ephemeral device mapping based on get_num_disks(). 6b116a6 [Daniel Darabos] Add SSDs to block device mapping	2014-09-01 22:16:42 -07:00
Reynold Xin	db160676c5	[SPARK-3135] Avoid extra mem copy in TorrentBroadcast via ByteArrayChunkOutputStream This also enables supporting broadcast variables larger than 2G. Author: Reynold Xin <rxin@apache.org> Closes #2054 from rxin/ByteArrayChunkOutputStream and squashes the following commits: 618d9c8 [Reynold Xin] Code review. 93f5a51 [Reynold Xin] Added comments. ee88e73 [Reynold Xin] to -> until bbd1cb1 [Reynold Xin] Renamed a variable. 36f4d01 [Reynold Xin] Sort imports. 8f1a8eb [Reynold Xin] [SPARK-3135] Created ByteArrayChunkOutputStream and used it to avoid memory copy in TorrentBroadcast.	2014-09-01 20:32:31 -07:00
Patrick Wendell	1f98add926	MAINTENANCE: Automated closing of pull requests. This commit exists to close the following pull requests on Github: Closes #1696 (close requested by 'pwendell') Closes #1384 (close requested by 'pwendell') Closes #845 (close requested by 'pwendell') Closes #81 (close requested by 'pwendell') Closes #1528 (close requested by 'pwendell') Closes #1018 (close requested by 'pwendell')	2014-09-01 19:25:54 -07:00
scwf	725715cbf3	[SPARK-3010] fix redundant conditional https://issues.apache.org/jira/browse/SPARK-3010 this pr is to fix redundant conditional in spark, such as 1. private[spark] def codegenEnabled: Boolean = if (getConf(CODEGEN_ENABLED, "false") == "true") true else false 2. x => if (x == 2) true else false ... Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei_hello@126.com> Closes #1992 from scwf/condition and squashes the following commits: b2a044a [scwf] merge SecurityManager e16239c [scwf] fix confilct 6811401 [scwf] fix merge confilct 0824df4 [scwf] Merge branch 'master' of https://github.com/apache/spark into patch-4 e274515 [scwf] fix redundant conditions d032bf9 [wangfei] [SQL]Excess judgment	2014-08-31 14:02:11 -07:00
Nicholas Chammas	c567a68a59	[Spark QA] only check code files for new classes Look only at code files (`.py`, `.java`, and `.scala`) for new classes. Should get rid of false alarms like [the one reported here](https://github.com/apache/spark/pull/2014#issuecomment-52912040). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2184 from nchammas/jenkins-ignore-noncode and squashes the following commits: 33786ac [Nicholas Chammas] break up long line 3f91a14 [Nicholas Chammas] rename array of source files 8b82a26 [Nicholas Chammas] [Spark QA] only check code files for new classes	2014-08-30 21:11:48 -07:00
Patrick Wendell	9b8c2287bb	MAINTENANCE: Automated closing of pull requests. This commit exists to close the following pull requests on Github: Closes #1922 (close requested by 'JoshRosen') Closes #1356 (close requested by 'pwendell') Closes #1698 (close requested by 'mengxr') Closes #254 (close requested by 'mateiz') Closes #2135 (close requested by 'andrewor14')	2014-08-30 20:58:48 -07:00
Holden Karau	ba78383bac	SPARK-3318: Documentation update in addFile on how to use SparkFiles.get Rather than specifying the path to SparkFiles we need to use the filename. Author: Holden Karau <holden@pigscanfly.ca> Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits: a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well 0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path	2014-08-30 16:58:17 -07:00
Marcelo Vanzin	b6cf134817	[SPARK-2889] Create Hadoop config objects consistently. Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config. This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext. The places that were not changed fall in one of the following categories: - Test code where this doesn't really matter - Places deep in the code where plumbing SparkConf would be too difficult for very little gain - Default values for arguments - since the caller can provide their own config in that case Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #1843 from vanzin/SPARK-2889 and squashes the following commits: 52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works. 53f9506 [Marcelo Vanzin] Add DeveloperApi annotation. 3d345cb [Marcelo Vanzin] Restore old method for backwards compat. fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889 3f26760 [Marcelo Vanzin] Compilation fix. f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil. b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument. 1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.	2014-08-30 14:48:07 -07:00
Reynold Xin	d90434c035	Manually close old pull requests Closes #1824	2014-08-29 23:21:38 -07:00
Raymond Liu	acea92806c	[SPARK-2288] Hide ShuffleBlockManager behind ShuffleManager By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details. Author: Raymond Liu <raymond.liu@intel.com> Closes #1241 from colorant/shuffle and squashes the following commits: 0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager	2014-08-29 23:05:18 -07:00
Kousuke Saruta	7e662af332	[SPARK-3305] Remove unused import from UI classes. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2200 from sarutak/SPARK-3305 and squashes the following commits: 3cbd6ee [Kousuke Saruta] Removed unused import from classes related to UI	2014-08-29 22:52:32 -07:00
Patrick Wendell	a004a8d879	BUILD: Adding back CDH4 as per user requests	2014-08-29 22:24:35 -07:00
Cheng Lian	32b18dd52c	[SPARK-3320][SQL] Made batched in-memory column buffer building work for SchemaRDDs with empty partitions Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2213 from liancheng/spark-3320 and squashes the following commits: 45a0139 [Cheng Lian] Fixed typo in InMemoryColumnarQuerySuite f67067d [Cheng Lian] Fixed SPARK-3320	2014-08-29 18:16:47 -07:00
wangfei	13901764f4	[SPARK-3296][mllib] spark-example should be run-example in head notation of DenseKMeans and SparseNaiveBayes `./bin/spark-example` should be `./bin/run-example` in DenseKMeans and SparseNaiveBayes Author: wangfei <wangfei_hello@126.com> Closes #2193 from scwf/run-example and squashes the following commits: 207eb3a [wangfei] spark-example should be run-example 27a8999 [wangfei] ./bin/spark-example should be ./bin/run-example	2014-08-29 17:37:15 -07:00

... 2 3 4 5 6 ...

8184 commits