ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Davies Liu	abf588f47a	[SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD 1. broadcast is triggle unexpected 2. fd is leaked in JVM (also leak in parallelize()) 3. broadcast is not unpersisted in JVM after RDD is not be used any more. cc JoshRosen , sorry for these stupid bugs. Author: Davies Liu <davies.liu@gmail.com> Closes #2603 from davies/fix_broadcast and squashes the following commits: 080a743 [Davies Liu] fix bugs in broadcast large closure of RDD	2014-10-01 11:21:34 -07:00
Masayoshi TSUZUKI	0bfd3afb00	[SPARK-3757] mvn clean doesn't delete some files Added directory to be deleted into maven-clean-plugin in pom.xml. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #2613 from tsudukim/feature/SPARK-3757 and squashes the following commits: 8804bfc [Masayoshi TSUZUKI] Modified indent. 67c7171 [Masayoshi TSUZUKI] [SPARK-3757] mvn clean doesn't delete some files	2014-10-01 08:55:04 -07:00
Reynold Xin	3888ee2f38	[SPARK-3748] Log thread name in unit test logs Thread names are useful for correlating failures. Author: Reynold Xin <rxin@apache.org> Closes #2600 from rxin/log4j and squashes the following commits: 83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs	2014-10-01 01:03:49 -07:00
Joseph K. Bradley	7bf6cc9701	[SPARK-3751] [mllib] DecisionTree: example update + print options DecisionTreeRunner functionality additions: * Allow user to pass in a test dataset * Do not print full model if the model is too large. As part of this, modify DecisionTreeModel and RandomForestModel to allow printing less info. Proposed updates: * toString: prints model summary * toDebugString: prints full model (named after RDD.toDebugString) Similar update to Python API: * __repr__() now prints a model summary * toDebugString() now prints the full model CC: mengxr chouqin manishamde codedeft Small update (whomever can take a look). Thanks! Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2604 from jkbradley/dtrunner-update and squashes the following commits: b2b3c60 [Joseph K. Bradley] re-added python sql doc test, temporarily removed before 07b1fae [Joseph K. Bradley] repr() now prints a model summary toDebugString() now prints the full model 1d0d93d [Joseph K. Bradley] Updated DT and RF to print less when toString is called. Added toDebugString for verbose printing. 22eac8c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update e007a95 [Joseph K. Bradley] Updated DecisionTreeRunner to accept a test dataset.	2014-10-01 01:03:24 -07:00
Reynold Xin	eb43043f41	[SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task Author: Reynold Xin <rxin@apache.org> Closes #2599 from rxin/SPARK-3747 and squashes the following commits: a74c04d [Reynold Xin] Added a line of comment explaining NonFatal 0e8d44c [Reynold Xin] [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task	2014-10-01 00:29:14 -07:00
Davies Liu	c5414b6818	[SPARK-3478] [PySpark] Profile the Python tasks This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example: ``` ============================================================ Profile of RDD<id=3> ============================================================ 5146507 function calls (5146487 primitive calls) in 71.094 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 5144576 68.331 0.000 68.331 0.000 statcounter.py:44(merge) 20 2.735 0.137 71.071 3.554 statcounter.py:33(__init__) 20 0.017 0.001 0.017 0.001 {cPickle.dumps} 1024 0.003 0.000 0.003 0.000 t.py:16(<lambda>) 20 0.001 0.000 0.001 0.000 {reduce} 21 0.001 0.000 0.001 0.000 {cPickle.loads} 20 0.001 0.000 0.001 0.000 copy_reg.py:95(_slotnames) 41 0.001 0.000 0.001 0.000 serializers.py:461(read_int) 40 0.001 0.000 0.002 0.000 serializers.py:179(_batched) 62 0.000 0.000 0.000 0.000 {method 'read' of 'file' objects} 20 0.000 0.000 71.072 3.554 rdd.py:863(<lambda>) 20 0.000 0.000 0.001 0.000 serializers.py:198(load_stream) 40/20 0.000 0.000 71.072 3.554 rdd.py:2093(pipeline_func) 41 0.000 0.000 0.002 0.000 serializers.py:130(load_stream) 40 0.000 0.000 71.072 1.777 rdd.py:304(func) 20 0.000 0.000 71.094 3.555 worker.py:82(process) ``` Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk by `sc.dump_profiles(path)`, such as ```python >>> sc._conf.set("spark.python.profile", "true") >>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() ============================================================ Profile of RDD<id=1> ============================================================ 284 function calls (276 primitive calls) in 0.001 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 4 0.000 0.000 0.000 0.000 serializers.py:198(load_stream) 4 0.000 0.000 0.000 0.000 {reduce} 12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func) 4 0.000 0.000 0.000 0.000 {cPickle.loads} 4 0.000 0.000 0.000 0.000 {cPickle.dumps} 104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>) 8 0.000 0.000 0.000 0.000 serializers.py:461(read_int) 12 0.000 0.000 0.000 0.000 rdd.py:303(func) ``` The profiling is disabled by default, can be enabled by "spark.python.profile=true". Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump" This is bugfix of #2351 cc JoshRosen Author: Davies Liu <davies.liu@gmail.com> Closes #2556 from davies/profiler and squashes the following commits: e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 858e74c [Davies Liu] compatitable with python 2.6 7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles() 2b0daf2 [Davies Liu] fix docs 7a56c24 [Davies Liu] bugfix cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 09d02c3 [Davies Liu] Merge branch 'master' into profiler c23865c [Davies Liu] Merge branch 'master' into profiler 15d6f18 [Davies Liu] add docs for two configs dadee1a [Davies Liu] add docs string and clear profiles after show or dump 4f8309d [Davies Liu] address comment, add tests 0a5b6eb [Davies Liu] fix Python UDF 4b20494 [Davies Liu] add profile for python	2014-09-30 18:24:57 -07:00
Xiangrui Meng	d75496b189	[SPARK-3701][MLLIB] update python linalg api and small fixes 1. doc updates 2. simple checks on vector dimensions 3. use column major for matrices davies jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2548 from mengxr/mllib-py-clean and squashes the following commits: 6dce2df [Xiangrui Meng] address comments 116b5db [Xiangrui Meng] use np.dot instead of array.dot 75f2fcc [Xiangrui Meng] fix python style fefce00 [Xiangrui Meng] better check of vector size with more tests 067ef71 [Xiangrui Meng] majored -> major ef853f9 [Xiangrui Meng] update python linalg api and small fixes	2014-09-30 17:10:36 -07:00
Reynold Xin	6c696d7da6	Remove compiler warning from TaskContext change. Author: Reynold Xin <rxin@apache.org> Closes #2602 from rxin/warning and squashes the following commits: 130186b [Reynold Xin] Remove compiler warning from TaskContext change.	2014-09-30 15:55:04 -07:00
Sean Owen	8764fe368b	SPARK-3744 [STREAMING] FlumeStreamSuite will fail during port contention Since it looked quite easy, I took the liberty of making a quick PR that just uses `Utils.startServiceOnPort` to fix this. It works locally for me. Author: Sean Owen <sowen@cloudera.com> Closes #2601 from srowen/SPARK-3744 and squashes the following commits: ddc9319 [Sean Owen] Avoid port contention in tests by retrying several ports for Flume stream	2014-09-30 15:18:51 -07:00
Nicholas Chammas	d3a3840e07	[Build] Post commit hash with timeout messages [By request](https://github.com/apache/spark/pull/2588#issuecomment-57266871), and because it also makes sense. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2597 from nchammas/timeout-commit-hash and squashes the following commits: 3d90714 [Nicholas Chammas] Revert "testing: making timeout 1 minute" 2353c95 [Nicholas Chammas] testing: making timeout 1 minute e3a477e [Nicholas Chammas] post commit hash with timeout	2014-09-30 13:28:41 -07:00
shane knapp	a01a30927d	SPARK-3745 - fix check-license to properly download and check jar for details, see: https://issues.apache.org/jira/browse/SPARK-3745 Author: shane knapp <incomplete@gmail.com> Closes #2596 from shaneknapp/SPARK-3745 and squashes the following commits: c95eea9 [shane knapp] SPARK-3745 - fix check-license to properly download and check jar	2014-09-30 13:11:25 -07:00
Sean Owen	ab6dd80ba0	[SPARK-3356] [DOCS] Document when RDD elements' ordering within partitions is nondeterministic As suggested by mateiz , and because it came up on the mailing list again last week, this attempts to document that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods. Suggestions welcome about the wording, or other methods that need a note. Author: Sean Owen <sowen@cloudera.com> Closes #2508 from srowen/SPARK-3356 and squashes the following commits: b7c96fd [Sean Owen] Undo change to programming guide ad4aeec [Sean Owen] Don't mention ordering in partition-wise methods, reword description of ordering for zip methods per review, and add similar note to programming guide, which mentions groupByKey (but not zip methods) fce943b [Sean Owen] Note that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods	2014-09-30 11:15:38 -07:00
Patrick Wendell	157e7d0f62	HOTFIX: Ignore flaky tests in YARN	2014-09-30 09:46:58 -07:00
Patrick Wendell	b64fcbd2dc	Revert "[SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive" This reverts commit `0bbe7faeff`.	2014-09-30 09:43:46 -07:00
Josh Rosen	b167a8c7e7	[SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment When using spark-submit in `cluster` mode to submit a job to a Spark Standalone cluster, if the JAVA_HOME environment variable was set on the submitting machine then DriverRunner would attempt to use the submitter's JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), causing the driver to fail unless the submitter and worker had the same Java location. This commit fixes this by reading JAVA_HOME from sys.env instead of command.environment. Author: Josh Rosen <joshrosen@apache.org> Closes #2586 from JoshRosen/SPARK-3734 and squashes the following commits: e9513d9 [Josh Rosen] [SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment.	2014-09-29 23:36:10 -07:00
Reynold Xin	de700d3177	[SPARK-3709] Executors don't always report broadcast block removal properly back to the driver The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it. cc @tdas Author: Reynold Xin <rxin@apache.org> Closes #2588 from rxin/debug and squashes the following commits: 6dab2e3 [Reynold Xin] Don't log random messages. f430686 [Reynold Xin] Always report broadcast removal back to master. 2a13f70 [Reynold Xin] iii	2014-09-29 23:17:53 -07:00
Reynold Xin	6b79bfb425	[SPARK-3613] Record only average block size in MapStatus for large stages This changes the way we send MapStatus from executors back to driver for large stages (>2000 tasks). For large stages, we no longer send one byte per block. Instead, we just send the average block size. This makes large jobs (tens of thousands of tasks) much more reliable since the driver no longer sends huge amount of data. Author: Reynold Xin <rxin@apache.org> Closes #2470 from rxin/mapstatus and squashes the following commits: 822ff54 [Reynold Xin] Code review feedback. 3b86f56 [Reynold Xin] Added MimaExclude. f89d182 [Reynold Xin] Fixed a bug in MapStatus 6a0401c [Reynold Xin] [SPARK-3613] Record only average block size in MapStatus for large stages.	2014-09-29 22:56:22 -07:00
Reynold Xin	210404a561	Minor cleanup of code. Author: Reynold Xin <rxin@apache.org> Closes #2581 from rxin/minor-cleanup and squashes the following commits: 736a91b [Reynold Xin] Minor cleanup of code.	2014-09-29 21:53:21 -07:00
oded	dc30e4504a	Fixed the condition in StronglyConnectedComponents Issue: SPARK-3635 Author: oded <oded@HP-DV6.c4internal.c4-security.com> Closes #2486 from odedz/master and squashes the following commits: dd7890a [oded] Fixed the condition in StronglyConnectedComponents Issue: SPARK-3635	2014-09-29 18:05:53 -07:00
yingjieMiao	51229ff7f4	[graphX] GraphOps: random pick vertex bug When `numVertices > 50`, probability is set to 0. This would cause infinite loop. Author: yingjieMiao <yingjie@42go.com> Closes #2553 from yingjieMiao/graphx and squashes the following commits: 6adf3c8 [yingjieMiao] [graphX] GraphOps: random pick vertex bug	2014-09-29 18:01:27 -07:00
baishuo(白硕)	0bbe7faeff	[SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive a new PR base on new master. changes are the same as https://github.com/apache/spark/pull/1919 Author: baishuo(白硕) <vc_java@hotmail.com> Author: baishuo <vc_java@hotmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2226 from baishuo/patch-3007 and squashes the following commits: e69ce88 [Cheng Lian] Adds tests to verify dynamic partitioning folder layout b20a3dc [Cheng Lian] Addresses @yhuai's comments 096bbbc [baishuo(白硕)] Merge pull request #1 from liancheng/refactor-dp 1093c20 [Cheng Lian] Adds more tests 5004542 [Cheng Lian] Minor refactoring fae9eff [Cheng Lian] Refactors InsertIntoHiveTable to a Command 528e84c [Cheng Lian] Fixes typo in test name, regenerated golden answer files c464b26 [Cheng Lian] Refactors dynamic partitioning support 5033928 [baishuo] pass check style 2201c75 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name b47c9bf [baishuo] modify according micheal's advice c3ab36d [baishuo] modify for some bad indentation 7ce2d9f [baishuo] modify code to pass scala style checks 37c1c43 [baishuo] delete a empty else branch 66e33fc [baishuo] do a little modify 88d0110 [baishuo] update file after test a3961d9 [baishuo(白硕)] Update Cast.scala f7467d0 [baishuo(白硕)] Update InsertIntoHiveTable.scala c1a59dd [baishuo(白硕)] Update Cast.scala 0e18496 [baishuo(白硕)] Update HiveQuerySuite.scala 60f70aa [baishuo(白硕)] Update InsertIntoHiveTable.scala 0a50db9 [baishuo(白硕)] Update HiveCompatibilitySuite.scala 491c7d0 [baishuo(白硕)] Update InsertIntoHiveTable.scala a2374a8 [baishuo(白硕)] Update InsertIntoHiveTable.scala 701a814 [baishuo(白硕)] Update SparkHadoopWriter.scala dc24c41 [baishuo(白硕)] Update HiveQl.scala	2014-09-29 15:51:55 -07:00
Reynold Xin	e43c72fe04	Add more debug message for ManagedBuffer This is to help debug the error reported at http://apache-spark-user-list.1001560.n3.nabble.com/SQL-queries-fail-in-1-2-0-SNAPSHOT-td15327.html Author: Reynold Xin <rxin@apache.org> Closes #2580 from rxin/buffer-debug and squashes the following commits: 5814292 [Reynold Xin] Logging close() in case close() fails. 323dfec [Reynold Xin] Add more debug message.	2014-09-29 12:38:24 -07:00
jerryshao	dab1b0ae29	[SPARK-3032][Shuffle] Fix key comparison integer overflow introduced sorting exception Previous key comparison in `ExternalSorter` will get wrong sorting result or exception when key comparison overflows, details can be seen in [SPARK-3032](https://issues.apache.org/jira/browse/SPARK-3032). Here fix this and add a unit test to prove it. Author: jerryshao <saisai.shao@intel.com> Closes #2514 from jerryshao/SPARK-3032 and squashes the following commits: 6f3c302 [jerryshao] Improve the unit test according to comments 01911e6 [jerryshao] Change the test to show the contract violate exception 83acb38 [jerryshao] Minor changes according to comments fa2a08f [jerryshao] Fix key comparison integer overflow introduced sorting exception	2014-09-29 11:25:32 -07:00
Reza Zadeh	587a0cd7ed	[MLlib] [SPARK-2885] DIMSUM: All-pairs similarity # All-pairs similarity via DIMSUM Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach. Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold. The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities. ![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png) [1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467 [2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082 # Testing Tests for all invocations included. Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them. Author: Reza Zadeh <rizlar@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits: 404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2 ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero. 3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity 9fe17c0 [Xiangrui Meng] organize imports 2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2 254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 f2947e4 [Xiangrui Meng] some optimization 3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2 0e4eda4 [Reza Zadeh] Use partition index for RNG 251bb9c [Reza Zadeh] Documentation 25e9d0d [Reza Zadeh] Line length for style fb296f6 [Reza Zadeh] renamed to normL1 and normL2 3764983 [Reza Zadeh] Documentation e9c6791 [Reza Zadeh] New interface and documentation 613f261 [Reza Zadeh] Column magnitude summary 75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle 0f12ade [Reza Zadeh] Style changes eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix 41e8ece [Reza Zadeh] style changes 139c8e1 [Reza Zadeh] Syntax changes 029aa9c [Reza Zadeh] javadoc and new test 75edb25 [Reza Zadeh] All tests passing! 05e59b8 [Reza Zadeh] Add test 502ce52 [Reza Zadeh] new interface 654c4fb [Reza Zadeh] default methods 3726ca9 [Reza Zadeh] Remove MatrixAlgebra 6bebabb [Reza Zadeh] remove changes to MatrixSuite 5b8cd7d [Reza Zadeh] Initial files	2014-09-29 11:15:09 -07:00
Nicholas Chammas	aedd251c54	[EC2] Sort long, manually-inputted dictionaries Similar to the work done in #2571, this PR just sorts the remaining manually-inputted dicts in the EC2 script so they are easier to maintain. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2578 from nchammas/ec2-dict-sort and squashes the following commits: f55c692 [Nicholas Chammas] sort long dictionaries	2014-09-29 10:45:08 -07:00
Zhang, Liye	657bdff41a	[CORE] Bugfix: LogErr format in DAGScheduler.scala Author: Zhang, Liye <liye.zhang@intel.com> Closes #2572 from liyezhang556520/DAGLogErr and squashes the following commits: 5be2491 [Zhang, Liye] Bugfix: LogErr format in DAGScheduler.scala	2014-09-29 01:13:15 -07:00
Nicholas Chammas	1651cc117d	[EC2] Cleanup Python parens and disk dict Minor fixes: * Remove unnecessary parens (Python style) * Sort `disks_by_instance` dict and remove duplicate `t1.micro` key Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2571 from nchammas/ec2-polish and squashes the following commits: 9d203d5 [Nicholas Chammas] paren and dict cleanup	2014-09-28 21:55:09 -07:00
Joseph K. Bradley	0dc2b6361d	[SPARK-1545] [mllib] Add Random Forests This PR adds RandomForest to MLlib. The implementation is basic, and future performance optimizations will be important. (Note: RFs = Random Forests.) # Overview ## RandomForest * trains multiple trees at once to reduce the number of passes over the data * allows feature subsets at each node * uses a queue of nodes instead of fixed groups for each level This implementation is based an implementation by manishamde and the [Alpine Labs Sequoia Forest](https://github.com/AlpineNow/SparkML2) by codedeft (in particular, the TreePoint, BaggedPoint, and node queue implementations). Thank you for your inputs! ## Testing Correctness: This has been tested for correctness with the test suites and with DecisionTreeRunner on example datasets. Performance: This has been performance tested using [this branch of spark-perf](https://github.com/jkbradley/spark-perf/tree/rfs). Results below. ### Regression tests for DecisionTree Summary: For training 1 tree, there are small regressions, especially from feature subsampling. In the table below, each row is a single (random) dataset. The 2 different sets of result columns are for 2 different RF implementations: * (numTrees): This is from an earlier commit, after implementing RandomForest to train multiple trees at once. It does not include any code for feature subsampling. * (feature subsets): This is from this current PR's code, after implementing feature subsampling. These tests were to identify regressions in DecisionTree, so they are training 1 tree with all of the features (i.e., no feature subsampling). These were run on an EC2 cluster with 15 workers, training 1 tree with maxDepth = 5 (= 6 levels). Speedup values < 1 indicate slowdowns from the old DecisionTree implementation. numInstances \| numFeatures \| runtime (sec) \| speedup \| runtime (sec) \| speedup ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| (numTrees) \| (numTrees) \| (feature subsets) \| (feature subsets) 20000 \| 100 \| 4.051 \| 1.044433473 \| 4.478 \| 0.9448414471 20000 \| 500 \| 8.472 \| 1.104461756 \| 9.315 \| 1.004508857 20000 \| 1500 \| 19.354 \| 1.05854087 \| 20.863 \| 0.9819776638 20000 \| 3500 \| 43.674 \| 1.072033704 \| 45.887 \| 1.020332556 200000 \| 100 \| 4.196 \| 1.171830315 \| 4.848 \| 1.014232673 200000 \| 500 \| 8.926 \| 1.082791844 \| 9.771 \| 0.989151571 200000 \| 1500 \| 20.58 \| 1.068415938 \| 22.134 \| 0.9934038131 200000 \| 3500 \| 48.043 \| 1.075203464 \| 52.249 \| 0.9886505005 2000000 \| 100 \| 4.944 \| 1.01355178 \| 5.796 \| 0.8645617667 2000000 \| 500 \| 11.11 \| 1.016831683 \| 12.482 \| 0.9050632911 2000000 \| 1500 \| 31.144 \| 1.017852556 \| 35.274 \| 0.8986789136 2000000 \| 3500 \| 79.981 \| 1.085382778 \| 101.105 \| 0.8586123337 20000000 \| 100 \| 8.304 \| 0.9270231214 \| 9.073 \| 0.8484514494 20000000 \| 500 \| 28.174 \| 1.083268262 \| 34.236 \| 0.8914592826 20000000 \| 1500 \| 143.97 \| 0.9579634646 \| 159.275 \| 0.8659111599 ### Tests for forests I have run other tests with numTrees=10 and with sqrt(numFeatures), and those indicate that multi-model training and feature subsets can speed up training for forests, especially when training deeper trees. # Details on specific classes ## Changes to DecisionTree * Main train() method is now in RandomForest. * findBestSplits() is no longer needed. (It split levels into groups, but we now use a queue of nodes.) * Many small changes to support RFs. (Note: These methods should be moved to RandomForest.scala in a later PR, but are in DecisionTree.scala to make code comparison easier.) ## RandomForest * Main train() method is from old DecisionTree. * selectNodesToSplit: Note that it selects nodes and feature subsets jointly to track memory usage. ## RandomForestModel * Stores an Array[DecisionTreeModel] * Prediction: * For classification, most common label. For regression, mean. * We could support other methods later. ## examples/.../DecisionTreeRunner * This now takes numTrees and featureSubsetStrategy, to support RFs. ## DTStatsAggregator * 2 types of functionality (w/ and w/o subsampling features): These require different indexing methods. (We could treat both as subsampling, but this is less efficient DTStatsAggregator is now abstract, and 2 child classes implement these 2 types of functionality. ## impurities * These now take instance weights. ## Node * Some vals changed to vars. * This is unfortunately a public API change (DeveloperApi). This could be avoided by creating a LearningNode struct, but would be awkward. ## RandomForestSuite Please let me know if there are missing tests! ## BaggedPoint This wraps TreePoint and holds bootstrap weights/counts. # Design decisions * BaggedPoint: BaggedPoint is separate from TreePoint since it may be useful for other bagging algorithms later on. * RandomForest public API: What options should be easily supported by the train* methods? Should ALL options be in the Java-friendly constructors? Should there be a constructor taking Strategy? * Feature subsampling options: What options should be supported? scikit-learn supports the same options, except for "onethird." One option would be to allow users to specific fractions ("0.1"): the current options could be supported, and any unrecognized values would be parsed as Doubles in [0,1]. * Splits and bins are computed before bootstrapping, so all trees use the same discretization. * One queue, instead of one queue per tree. CC: mengxr manishamde codedeft chouqin Please let me know if you have suggestions---thanks! Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Author: chouqin <liqiping1991@gmail.com> Closes #2435 from jkbradley/rfs-new and squashes the following commits: c694174 [Joseph K. Bradley] Fixed typo cc59d78 [Joseph K. Bradley] fixed imports e25909f [Joseph K. Bradley] Simplified node group maps. Specifically, created NodeIndexInfo to store node index in agg and feature subsets, and no longer create extra maps in findBestSplits fbe9a1e [Joseph K. Bradley] Changed default featureSubsetStrategy to be sqrt for classification, onethird for regression. Updated docs with references. ef7c293 [Joseph K. Bradley] Updates based on code review. Most substantial changes: * Simplified DTStatsAggregator * Made RandomForestModel.trees public * Added test for regression to RandomForestSuite 593b13c [Joseph K. Bradley] Fixed bug in metadata for computing log2(num features). Now it checks >= 1. a1a08df [Joseph K. Bradley] Removed old comments 866e766 [Joseph K. Bradley] Changed RandomForestSuite randomized tests to use multiple fixed random seeds. ff8bb96 [Joseph K. Bradley] removed usage of null from RandomForest and replaced with Option bf1a4c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new 6b79c07 [Joseph K. Bradley] Added RandomForestSuite, and fixed small bugs, style issues. d7753d4 [Joseph K. Bradley] Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to support RandomForest). Fixed bugs so that RandomForest now runs. 746d43c [Joseph K. Bradley] Implemented feature subsampling. Tested DecisionTree but not RandomForest. 6309d1d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new. Added RandomForestModel.toString b7ae594 [Joseph K. Bradley] Updated docs. Small fix for bug which does not cause errors: No longer allocate unused child nodes for leaf nodes. 121c74e [Joseph K. Bradley] Basic random forests are implemented. Random features per node not yet implemented. Test suite not implemented. 325d18a [Joseph K. Bradley] Merge branch 'chouqin-dt-preprune' into rfs-new 4ef9bf1 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new 61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy. a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune 6da8571 [Joseph K. Bradley] RFs partly implemented, not done yet eddd1eb [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs-new 5c4ac33 [Joseph K. Bradley] Added check in Strategy to make sure minInstancesPerNode >= 1 0dd4d87 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160 95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune 19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune f1d11d1 [chouqin] fix typo c7ebaf1 [chouqin] fix typo 39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py 306120f [Joseph K. Bradley] Fixed typo in DecisionTreeModel.scala doc eaa1dcf [Joseph K. Bradley] Added topNode doc in DecisionTree and scalastyle fix d4d7864 [Joseph K. Bradley] Marked Node.build as deprecated d4dbb99 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-spark-3160 1a8f0ad [Joseph K. Bradley] Eliminated pre-allocated nodes array in main train() method. * Nodes are constructed and added to the tree structure as needed during training. 0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1 2ab763b [Joseph K. Bradley] Simplifications to DecisionTree code: efcc736 [qiping.lqp] fix bug 10b8012 [qiping.lqp] fix style 6728fad [qiping.lqp] minor fix: remove empty lines bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune cadd569 [qiping.lqp] add api docs 46b891f [qiping.lqp] fix bug e72c7e4 [qiping.lqp] add comments 845c6fa [qiping.lqp] fix style f195e83 [qiping.lqp] fix style 987cbf4 [qiping.lqp] fix bug ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree	2014-09-28 21:44:50 -07:00
Reynold Xin	f350cd3070	[SPARK-3543] TaskContext remaining cleanup work. Author: Reynold Xin <rxin@apache.org> Closes #2560 from rxin/TaskContext and squashes the following commits: 9eff95a [Reynold Xin] [SPARK-3543] remaining cleanup work.	2014-09-28 20:32:54 -07:00
Jim Lim	25164a89dd	SPARK-2761 refactor #maybeSpill into Spillable Moved `#maybeSpill` in ExternalSorter and EAOM into `Spillable`. Author: Jim Lim <jim@quixey.com> Closes #2416 from jimjh/SPARK-2761 and squashes the following commits: cf8be9a [Jim Lim] SPARK-2761 fix documentation, reorder code f94d522 [Jim Lim] SPARK-2761 refactor Spillable to simplify sig e75a24e [Jim Lim] SPARK-2761 use protected over protected[this] 7270e0d [Jim Lim] SPARK-2761 refactor #maybeSpill into Spillable	2014-09-28 19:04:24 -07:00
Reynold Xin	8e874185ed	Revert "[SPARK-1021] Defer the data-driven computation of partition bounds in so..." This reverts commit `2d972fd84a`. The commit was hanging correlationoptimizer14.	2014-09-28 18:33:11 -07:00
WangTaoTheTonic	1f13a40ccd	[SPARK-3715][Docs]minor typo https://issues.apache.org/jira/browse/SPARK-3715 Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2567 from WangTaoTheTonic/minortypo and squashes the following commits: 9cc3f7a [WangTaoTheTonic] minor typo	2014-09-28 18:30:13 -07:00
William Benton	6918012d0f	SPARK-3699: SQL and Hive console tasks now clean up appropriately The sbt tasks sql/console and hive/console will now `stop()` the `SparkContext` upon exit. Previously, they left an ugly stack trace when quitting. Author: William Benton <willb@redhat.com> Closes #2547 from willb/consoleCleanup and squashes the following commits: d5e431f [William Benton] SQL and Hive console tasks now clean up.	2014-09-28 01:01:27 -07:00
Reynold Xin	66e1c40c67	Minor fix for the previous commit.	2014-09-27 22:18:02 -07:00
Dale	9966d1a8aa	SPARK-CORE [SPARK-3651] Group common CoarseGrainedSchedulerBackend variables together from [SPARK-3651] In CoarseGrainedSchedulerBackend, we have: private val executorActor = new HashMap[String, ActorRef] private val executorAddress = new HashMap[String, Address] private val executorHost = new HashMap[String, String] private val freeCores = new HashMap[String, Int] private val totalCores = new HashMap[String, Int] We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299. Author: Dale <tigerquoll@outlook.com> Closes #2533 from tigerquoll/SPARK-3651 and squashes the following commits: d1be0a9 [Dale] [SPARK-3651] implemented suggested changes. Changed a reference from executorInfo to executorData to be consistent with other usages 6890663 [Dale] [SPARK-3651] implemented suggested changes 7d671cf [Dale] [SPARK-3651] Grouped variables under a ExecutorDataObject, and reference them via a map entry as they are all retrieved under the same key	2014-09-27 22:08:10 -07:00
Uri Laserson	248232936e	[SPARK-3389] Add Converter for ease of Parquet reading in PySpark https://issues.apache.org/jira/browse/SPARK-3389 Author: Uri Laserson <laserson@cloudera.com> Closes #2256 from laserson/SPARK-3389 and squashes the following commits: 0ed363e [Uri Laserson] PEP8'd the python file 0b4b380 [Uri Laserson] Moved converter to examples and added python example eecf4dc [Uri Laserson] [SPARK-3389] Add Converter for ease of Parquet reading in PySpark	2014-09-27 21:48:05 -07:00
Reynold Xin	5b922bb458	[SPARK-3543] Clean up Java TaskContext implementation. This addresses some minor issues in https://github.com/apache/spark/pull/2425 Author: Reynold Xin <rxin@apache.org> Closes #2557 from rxin/TaskContext and squashes the following commits: a51e5f6 [Reynold Xin] [SPARK-3543] Clean up Java TaskContext implementation.	2014-09-27 14:46:00 -07:00
Davies Liu	0d8cdf0ede	[SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects. This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust. Author: Davies Liu <davies.liu@gmail.com> Closes #2526 from davies/nested and squashes the following commits: 2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD	2014-09-27 12:21:37 -07:00
Michael Armbrust	f0c7e19550	[SPARK-3680][SQL] Fix bug caused by eager typing of HiveGenericUDFs Typing of UDFs should be lazy as it is often not valid to call `dataType` on an expression until after all of its children are `resolved`. Author: Michael Armbrust <michael@databricks.com> Closes #2525 from marmbrus/concatBug and squashes the following commits: 5b8efe7 [Michael Armbrust] fix bug with eager typing of udfs	2014-09-27 12:10:16 -07:00
w00228970	0800881051	[SPARK-3676][SQL] Fix hive test suite failure due to diffs in JDK 1.6/1.7 This is a bug in JDK6: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022 this is because jdk get different result to operate ```double```, ```System.out.println(1/500d)``` in different jdk get different result jdk 1.6.0(_31) ---- 0.0020 jdk 1.7.0(_05) ---- 0.002 this leads to HiveQuerySuite failed when generate golden answer in jdk 1.7 and run tests in jdk 1.6, result did not match Author: w00228970 <wangfei1@huawei.com> Closes #2517 from scwf/HiveQuerySuite and squashes the following commits: 0cb5e8d [w00228970] delete golden answer of division-0 and timestamp cast #1 1df3964 [w00228970] Jdk version leads to different query output for Double, this make HiveQuerySuite failed	2014-09-27 12:06:16 -07:00
CrazyJvm	66107f46f3	Docs : use "--total-executor-cores" rather than "--cores" after spark-shell Author: CrazyJvm <crazyjvm@gmail.com> Closes #2540 from CrazyJvm/standalone-core and squashes the following commits: 66d9fc6 [CrazyJvm] use "--total-executor-cores" rather than "--cores" after spark-shell	2014-09-27 09:42:01 -07:00
Reynold Xin	436a7730b6	Minor cleanup to tighten visibility and remove compilation warning. Author: Reynold Xin <rxin@apache.org> Closes #2555 from rxin/cleanup and squashes the following commits: 6add199 [Reynold Xin] Minor cleanup to tighten visibility and remove compilation warning.	2014-09-27 00:57:26 -07:00
Erik Erlandson	2d972fd84a	[SPARK-1021] Defer the data-driven computation of partition bounds in so... ...rtByKey() until evaluation. Author: Erik Erlandson <eerlands@redhat.com> Closes #1689 from erikerlandson/spark-1021-pr and squashes the following commits: 50b6da6 [Erik Erlandson] use standard getIteratorSize in countAsync 4e334a9 [Erik Erlandson] exception mystery fixed by fixing bug in ComplexFutureAction b88b5d4 [Erik Erlandson] tweak async actions to use ComplexFutureAction[T] so they handle RangePartitioner sampling job properly b2b20e8 [Erik Erlandson] Fix bug in exception passing with ComplexFutureAction[T] ca8913e [Erik Erlandson] RangePartition sampling job -> FutureAction 7143f97 [Erik Erlandson] [SPARK-1021] modify range bounds variable to be thread safe ac67195 [Erik Erlandson] [SPARK-1021] Defer the data-driven computation of partition bounds in sortByKey() until evaluation.	2014-09-26 23:15:10 -07:00
Jeff Steinmetz	9e8ced7847	stop, start and destroy require the EC2_REGION i.e ./spark-ec2 --region=us-west-1 stop yourclustername Author: Jeff Steinmetz <jeffrey.steinmetz@gmail.com> Closes #2473 from jeffsteinmetz/master and squashes the following commits: 7491f2c [Jeff Steinmetz] fix case in EC2 cluster setup documentation bd3d777 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args 2bf4a57 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args 68d8372 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args d2ab6e2 [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args 520e6dc [Jeff Steinmetz] standardized ec2 documenation to use <lower-case> sample args 37fc876 [Jeff Steinmetz] stop, start and destroy require the EC2_REGION	2014-09-26 23:00:40 -07:00
Michael Armbrust	d8a9d1d442	[SPARK-3675][SQL] Allow starting a JDBC server on an existing context Author: Michael Armbrust <michael@databricks.com> Closes #2515 from marmbrus/jdbcExistingContext and squashes the following commits: 7866fad [Michael Armbrust] Allows starting a JDBC server on an existing context.	2014-09-26 22:30:12 -07:00
Michael Armbrust	f0eea76d94	[SQL][DOCS] Clarify that the server is for JDBC and ODBC Author: Michael Armbrust <michael@databricks.com> Closes #2527 from marmbrus/patch-1 and squashes the following commits: a0f9f1c [Michael Armbrust] [SQL][DOCS] Clarify that the server is for JDBC and ODBC	2014-09-26 22:24:34 -07:00
wangfei	0cdcdd2c9d	[Build]remove spark-staging-1030 Since 1.1.0 has published, remove spark-staging-1030. Author: wangfei <wangfei1@huawei.com> Closes #2532 from scwf/patch-2 and squashes the following commits: bc9e00b [wangfei] remove spark-staging-1030	2014-09-26 22:23:49 -07:00
Sarah Gerweck	e976ca236f	Slaves file is now a template. Change `0dc868e` removed the `conf/slaves` file and made it a template like most of the other configuration files. This means you can no longer run `make-distribution.sh` unless you manually create a slaves file to be statically bundled in your distribution, which seems at odds with making it a template file. Author: Sarah Gerweck <sarah.a180@gmail.com> Closes #2549 from sarahgerweck/noMoreSlaves and squashes the following commits: d11d99a [Sarah Gerweck] Slaves file is now a template.	2014-09-26 22:21:50 -07:00
Reynold Xin	a3feaf04dc	Close #2194 .	2014-09-26 21:44:10 -07:00
Prashant Sharma	5e34855cf0	[SPARK-3543] Write TaskContext in Java and expose it through a static accessor. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Shashank Sharma <shashank21j@gmail.com> Closes #2425 from ScrapCodes/SPARK-3543/withTaskContext and squashes the following commits: 8ae414c [Shashank Sharma] CR ee8bd00 [Prashant Sharma] Added internal API in docs comments. ddb8cbe [Prashant Sharma] Moved setting the thread local to where TaskContext is instantiated. a7d5e23 [Prashant Sharma] Added doc comments. edf945e [Prashant Sharma] Code review git add -A f716fd1 [Prashant Sharma] introduced thread local for getting the task context. 333c7d6 [Prashant Sharma] Translated Task context from scala to java.	2014-09-26 21:29:54 -07:00

... 2 3 4 5 6 ...

8441 commits