ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Prashant Sharma	81fec9922c	[SPARK-2452] Create a new valid for each instead of using lineId. Author: Prashant Sharma <prashant@apache.org> Closes #1441 from ScrapCodes/SPARK-2452/multi-statement and squashes the following commits: 26c5c72 [Prashant Sharma] Added a test case. 7e8d28d [Prashant Sharma] SPARK-2452, create a new valid for each instead of using lineId, because Line ids can be same sometimes.	2014-07-22 00:38:26 -07:00
Nicholas Chammas	5d16d5bbfd	[SPARK-2470] PEP8 fixes to PySpark This pull request aims to resolve all outstanding PEP8 violations in PySpark. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #1505 from nchammas/master and squashes the following commits: 98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break 9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses 22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses 24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest 7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py 8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py 1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py 95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py 81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py 1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py 7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py 4dd148f [nchammas] Merge pull request #5 from apache/master f7e4581 [Nicholas Chammas] unrelated pep8 fix a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently de7292a [nchammas] Merge pull request #4 from apache/master 2e4fe00 [nchammas] Merge pull request #3 from apache/master 89fde08 [nchammas] Merge pull request #2 from apache/master 69f6e22 [Nicholas Chammas] PEP8 fixes 2627247 [Nicholas Chammas] broke up lines before they hit 100 chars 6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names 69da6cf [nchammas] Merge pull request #1 from apache/master	2014-07-21 22:30:53 -07:00
Gregory Owen	c3462c6568	[SPARK-2086] Improve output of toDebugString to make shuffle boundaries more clear Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly New output: ``` (3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22 \| MappedValuesRDD[324] at apply at Transformer.scala:22 \| CoGroupedRDD[323] at apply at Transformer.scala:22 +-(5) MappedRDD[320] at apply at Transformer.scala:22 \| \| MappedRDD[319] at apply at Transformer.scala:22 \| \| MappedValuesRDD[318] at apply at Transformer.scala:22 \| \| MapPartitionsRDD[317] at apply at Transformer.scala:22 \| \| ShuffledRDD[316] at apply at Transformer.scala:22 \| +-(10) MappedRDD[315] at apply at Transformer.scala:22 \| \| ParallelCollectionRDD[314] at apply at Transformer.scala:22 +-(100) MappedRDD[322] at apply at Transformer.scala:22 \| ParallelCollectionRDD[321] at apply at Transformer.scala:22 ``` Author: Gregory Owen <greowen@gmail.com> Closes #1364 from GregOwen/to-debug-string and squashes the following commits: 08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly 1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly	2014-07-21 18:55:01 -07:00
Michael Armbrust	511a731403	[SPARK-2561][SQL] Fix apply schema We need to use the analyzed attributes otherwise we end up with a tree that will never resolve. Author: Michael Armbrust <michael@databricks.com> Closes #1470 from marmbrus/fixApplySchema and squashes the following commits: f968195 [Michael Armbrust] Use analyzed attributes when applying the schema. 4969015 [Michael Armbrust] Add test case.	2014-07-21 18:18:17 -07:00
Burak	a4d60208ec	[SPARK-2434][MLlib]: Warning messages that point users to original MLlib implementations added to Examples [SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified: Scala: * LocalALS * LocalFileLR * LocalKMeans * LocalLP * SparkALS * SparkHdfsLR * SparkKMeans * SparkLR Python: * kmeans.py * als.py * logistic_regression.py Author: Burak <brkyvz@gmail.com> Closes #1515 from brkyvz/SPARK-2434 and squashes the following commits: 7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed 4762f39 [Burak] [SPARK-2434]: Warning messages added 17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms 2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.	2014-07-21 17:03:40 -07:00
Aaron Davidson	abeacffb7b	Fix flakey HiveQuerySuite test Result may not be returned in the expected order, so relax that constraint. Author: Aaron Davidson <aaron@databricks.com> Closes #1514 from aarondav/flakey and squashes the following commits: e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test	2014-07-21 14:35:15 -07:00
Davies Liu	872538c600	[SPARK-2494] [PySpark] make hash of None consistant cross machines In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this. Author: Davies Liu <davies.liu@gmail.com> Closes #1371 from davies/hash_of_none and squashes the following commits: d01745f [Davies Liu] add comments, remove outdated unit tests 5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy() b7118aa [Davies Liu] use __builtin__ instead of __builtins__ 839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines	2014-07-21 11:59:54 -07:00
Sandy Ryza	f89cf65d7a	SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler Author: Sandy Ryza <sandy@cloudera.com> Closes #634 from sryza/sandy-spark-1707 and squashes the following commits: 2f6e358 [Sandy Ryza] Default min registered executors ratio to .8 for YARN 354c630 [Sandy Ryza] Remove outdated comments c744ef3 [Sandy Ryza] Take out waitForInitialAllocations 2a4329b [Sandy Ryza] SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler	2014-07-21 13:15:46 -05:00
Cheng Lian	cd273a2381	[SPARK-2190][SQL] Specialized ColumnType for Timestamp JIRA issue: [SPARK-2190](https://issues.apache.org/jira/browse/SPARK-2190) Added specialized in-memory column type for `Timestamp`. Whitelisted all timestamp related Hive tests except `timestamp_udf`, which is timezone sensitive. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1440 from liancheng/timestamp-column-type and squashes the following commits: e682175 [Cheng Lian] Enabled more timezone sensitive Hive tests. 53a358f [Cheng Lian] Fixed failed test suites 01b592d [Cheng Lian] Fixed SimpleDateFormat thread safety issue 2a59343 [Cheng Lian] Removed timezone sensitive Hive timestamp tests 45dd05d [Cheng Lian] Added Timestamp specific in-memory columnar representation	2014-07-21 00:46:28 -07:00
Michael Giannakopoulos	db56f2df1b	[SPARK-1945][MLLIB] Documentation Improvements for Spark 1.0 Standalone application examples are added to 'mllib-linear-methods.md' file written in Java. This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945). Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function is the inverse of sigmoid. Thanks, Michael Author: Michael Giannakopoulos <miccagiann@gmail.com> Closes #1311 from miccagiann/master and squashes the following commits: 8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards. f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master' 38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible. cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards. b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md]. 837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master' 15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.	2014-07-20 20:48:44 -07:00
Kay Ousterhout	f6e7302cb4	Improve scheduler delay tooltip. As a result of shivaram's experience debugging long scheduler delay, I think we should improve the tooltip to point people in the right direction if scheduler delay is large. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #1488 from kayousterhout/better_tooltips and squashes the following commits: 22176fd [Kay Ousterhout] Improve scheduler delay tooltip.	2014-07-20 20:18:18 -07:00
Xiangrui Meng	b86db517b6	[SPARK-2552][MLLIB] stabilize logistic function in pyspark to avoid overflow in `exp(x)` if `x` is large. Author: Xiangrui Meng <meng@databricks.com> Closes #1493 from mengxr/py-logistic and squashes the following commits: 259e863 [Xiangrui Meng] stabilize logistic function in pyspark	2014-07-20 18:40:36 -07:00
Sandy Ryza	9564f85489	SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant Author: Sandy Ryza <sandy@cloudera.com> Closes #1474 from sryza/sandy-spark-2564 and squashes the following commits: 35b8388 [Sandy Ryza] Fix compile error on upmerge 7b985fb [Sandy Ryza] Fix test compile error 43f79e6 [Sandy Ryza] SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant	2014-07-20 14:45:34 -07:00
Xiangrui Meng	1b10b8114a	[SPARK-2495][MLLIB] remove private[mllib] from linear models' constructors This is part of SPARK-2495 to allow users construct linear models manually. Author: Xiangrui Meng <meng@databricks.com> Closes #1492 from mengxr/public-constructor and squashes the following commits: a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors	2014-07-20 13:04:59 -07:00
Reynold Xin	fa51b0fb5b	[SPARK-2598] RangePartitioner's binary search does not use the given Ordering We should fix this in branch-1.0 as well. Author: Reynold Xin <rxin@apache.org> Closes #1500 from rxin/rangePartitioner and squashes the following commits: `c0a94f5` [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering.	2014-07-20 11:06:06 -07:00
Sandy Ryza	98ab411225	SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section... ...s of CoGroupedRDD and PairRDDFunctions This also removes an unnecessary tuple creation in cogroup. Author: Sandy Ryza <sandy@cloudera.com> Closes #1447 from sryza/sandy-spark-2519-2 and squashes the following commits: b6d9699 [Sandy Ryza] Remove missed Tuple2 match in CoGroupedRDD a109828 [Sandy Ryza] Remove another pattern matching in MappedValuesRDD and revert some changes in PairRDDFunctions be10f8a [Sandy Ryza] SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions	2014-07-20 01:24:32 -07:00
lianhuiwang	4da01e3813	[SPARK-2524] missing document about spark.deploy.retainedDrivers https://issues.apache.org/jira/browse/SPARK-2524 The configuration on spark.deploy.retainedDrivers is undocumented but actually used https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60 Author: lianhuiwang <lianhuiwang09@gmail.com> Author: Wang Lianhui <lianhuiwang09@gmail.com> Author: unknown <Administrator@taguswang-PC1.tencent.com> Closes #1443 from lianhuiwang/SPARK-2524 and squashes the following commits: 64660fd [Wang Lianhui] address pwendell's comments 5f6bbb7 [Wang Lianhui] missing document about spark.deploy.retainedDrivers 44a3f50 [unknown] Merge remote-tracking branch 'upstream/master' eacf933 [lianhuiwang] Merge remote-tracking branch 'upstream/master' 8bbfe76 [lianhuiwang] Merge remote-tracking branch 'upstream/master' 480ce94 [lianhuiwang] address aarondav comments f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED	2014-07-19 20:46:59 -07:00
Mark Wagner	c11949878d	SPARK-2587: Fix error message in make-distribution.sh make-distribution.sh gives a slightly off error message when using --with-hive. Author: Mark Wagner <mwagner@mwagner-ld.linkedin.biz> Closes #1489 from wagnermarkd/SPARK-2587 and squashes the following commits: 7b5d3ff [Mark Wagner] SPARK-2587: Fix error message in make-distribution.sh	2014-07-19 20:24:13 -07:00
Cesar Arevalo	0d01e85f42	Typo fix to the programming guide in the docs Typo fix to the programming guide in the docs. Changed the word "distibuted" to "distributed". Author: Cesar Arevalo <cesar@zephyrhealthinc.com> Closes #1495 from cesararevalo/master and squashes the following commits: 0c2e3a7 [Cesar Arevalo] Typo fix to the programming guide in the docs	2014-07-19 20:20:07 -07:00
Patrick Wendell	d39e3b9673	SPARK-2596 HOTFIX: Deal with non-existent JIRAs. A small bug that was found in our JIRA sync script.	2014-07-19 20:06:28 -07:00
Patrick Wendell	49e4727449	SPARK-2596 A tool for mirroring github pull requests on JIRA. For a bunch of reasons we should automatically populate a JIRA with information about new pull requests when they arrive. I've written a small python script to do this that we can run from Jenkins every 5 or 10 minutes to keep things in sync. Author: Patrick Wendell <pwendell@gmail.com> Closes #1496 from pwendell/github-integration and squashes the following commits: 55ad226 [Patrick Wendell] Small fix afda547 [Patrick Wendell] Use sequence instead of dictiory for JIRA's 3e18cc1 [Patrick Wendell] Small edits 84c5606 [Patrick Wendell] SPARK-2596 A tool for mirroring github pull requests on JIRA.	2014-07-19 18:19:08 -07:00
Reynold Xin	1efb3698b6	Revert "[SPARK-2521] Broadcast RDD object (instead of sending it along with every task)." This reverts commit `7b8cd17525`.	2014-07-19 16:56:22 -07:00
chutium	2a732110d4	SPARK-2407: Added Parser of SQL SUBSTR() follow-up of #1359 Author: chutium <teng.qiu@gmail.com> Closes #1442 from chutium/master and squashes the following commits: b49cc8a [chutium] SPARK-2407: Added Parser of SQL SUBSTRING() #1442 9a60ccf [chutium] SPARK-2407: Added Parser of SQL SUBSTR() #1442 06e933b [chutium] Merge https://github.com/apache/spark c870172 [chutium] Merge https://github.com/apache/spark 094f773 [chutium] Merge https://github.com/apache/spark 88cb37d [chutium] Merge https://github.com/apache/spark 1de83a7 [chutium] SPARK-2407: Added Parse of SQL SUBSTR()	2014-07-19 11:04:41 -05:00
Lijie Xu	805f329bb1	put 'curRequestSize = 0' after 'logDebug' it This is a minor change. We should first logDebug($curRequestSize) and then set it to 0. Author: Lijie Xu <csxulijie@gmail.com> Closes #1477 from JerryLead/patch-1 and squashes the following commits: aed722d [Lijie Xu] put 'curRequestSize = 0' after 'logDebug' it	2014-07-19 01:27:26 -07:00
Reynold Xin	7b8cd17525	[SPARK-2521] Broadcast RDD object (instead of sending it along with every task). Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables. The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large. The user-facing impact of the change include: 1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations 2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput. In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently). A simple way to test this: ```scala val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a); sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count ``` Numbers on 3 r3.8xlarge instances on EC2 ``` master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s ``` Author: Reynold Xin <rxin@apache.org> Closes #1452 from rxin/broadcast-task and squashes the following commits: 762e0be [Reynold Xin] Warn large broadcasts. ade6eac [Reynold Xin] Log broadcast size. c3b6f11 [Reynold Xin] Added a unit test for clean up. 754085f [Reynold Xin] Explain why broadcasting serialized copy of the task. 04b17f0 [Reynold Xin] [SPARK-2521] Broadcast RDD object once per TaskSet (instead of sending it for every task).	2014-07-18 23:52:47 -07:00
Doris Xin	a243364b22	[SPARK-2359][MLlib] Correlations Implementation for Pearson and Spearman's correlation. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1367 from dorx/correlation and squashes the following commits: c0dd7dc [Doris Xin] here we go 32d83a3 [Doris Xin] Reviewer comments 4db0da1 [Doris Xin] added private[stat] to Spearman b716f70 [Doris Xin] minor fixes 6e1b42a [Doris Xin] More comments addressed. Still some open questions 8104f44 [Doris Xin] addressed comments. some open questions still 39387c2 [Doris Xin] added missing header bd3cf19 [Doris Xin] Merge branch 'master' into correlation 6341884 [Doris Xin] race condition bug squished bd2bacf [Doris Xin] Race condition bug b775ff9 [Doris Xin] old wrong impl 534ebf2 [Doris Xin] Merge branch 'master' into correlation 818fa31 [Doris Xin] wip units 9d808ee [Doris Xin] wip units b843a13 [Doris Xin] revert change in stat counter 28561b6 [Doris Xin] wip bb2e977 [Doris Xin] minor fix 8e02c63 [Doris Xin] Merge branch 'master' into correlation 2a40aa1 [Doris Xin] initial, untested implementation of Pearson `dfc4854` [Doris Xin] WIP	2014-07-18 17:25:32 -07:00
Kay Ousterhout	7b971b91ca	[SPARK-2571] Correctly report shuffle read metrics. Currently, shuffle read metrics are incorrectly reported when stages have multiple shuffle dependencies (they are set to be the metrics from just one of the shuffle dependencies, rather than the accumulated metrics from all of the shuffle dependencies). This fixes that problem, and should probably be back-ported to the 0.9 branch. Thanks ryanra for discovering this problem! cc rxin andrewor14 Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #1476 from kayousterhout/join_bug and squashes the following commits: 0203a16 [Kay Ousterhout] Fix broken unit tests. f463c2e [Kay Ousterhout] [SPARK-2571] Correctly report shuffle read metrics.	2014-07-18 14:40:32 -07:00
Cheng Hao	7f17208137	[SPARK-2540] [SQL] Add HiveDecimal & HiveVarchar support in unwrapping data Author: Cheng Hao <hao.cheng@intel.com> Closes #1436 from chenghao-intel/unwrapdata and squashes the following commits: 34cc21a [Cheng Hao] update the table scan accodringly since the unwrapData function changed afc39da [Cheng Hao] Polish the code 39d6475 [Cheng Hao] Add HiveDecimal & HiveVarchar support in unwrap data	2014-07-18 16:38:11 -05:00
Takuya UESHIN	3a1709fa55	[SPARK-2535][SQL] Add StringComparison case to NullPropagation. `StringComparison` expressions including `null` literal cases could be added to `NullPropagation`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1451 from ueshin/issues/SPARK-2535 and squashes the following commits: e99c237 [Takuya UESHIN] Add some tests. 8f9b984 [Takuya UESHIN] Add StringComparison case to NullPropagation.	2014-07-18 16:24:00 -05:00
Manish Amde	d88f6be446	[MLlib] SPARK-1536: multiclass classification support for decision tree The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted. It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints. The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets. cc: mengxr, etrain, hirakendu, atalwalkar, srowen Author: Manish Amde <manish9ue@gmail.com> Author: manishamde <manish9ue@gmail.com> Author: Evan Sparks <sparks@cs.berkeley.edu> Closes #886 from manishamde/multiclass and squashes the following commits: 26f8acc [Manish Amde] another attempt at fixing mima c5b2d04 [Manish Amde] more MIMA fixes 1ce7212 [Manish Amde] change problem filter for mima 10fdd82 [Manish Amde] fixing MIMA excludes e1c970d [Manish Amde] merged master abf2901 [Manish Amde] adding classes to MimaExcludes.scala 45e767a [Manish Amde] adding developer api annotation for overriden methods c8428c4 [Manish Amde] fixing weird multiline bug afced16 [Manish Amde] removed label weights support 2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise 4e85f2c [Manish Amde] minor: fixed scalastyle issues b2ae41f [Manish Amde] minor: scalastyle e4c1321 [Manish Amde] using while loop for regression histograms d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR 0fecd38 [Manish Amde] minor: add newline to EOF 2061cf5 [Manish Amde] merged from master 06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion 9cc3e31 [Manish Amde] added implicit conversion import 5c1b2ca [Manish Amde] doc for PointConverter class 485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint 3d7f911 [Manish Amde] updated doc 8e44ab8 [Manish Amde] updated doc adc7315 [Manish Amde] support ordered categorical splits for multiclass classification e3e8843 [Manish Amde] minor code formatting 23d4268 [Manish Amde] minor: another minor code style 34ee7b9 [Manish Amde] minor: code style 237762d [Manish Amde] renaming functions 12e6d0a [Manish Amde] minor: removing line in doc 9a90c93 [Manish Amde] Merge branch 'master' into multiclass 1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present f5f6b83 [Manish Amde] multiclass for continous variables 8cfd3b6 [Manish Amde] working for categorical multiclass classification 828ff16 [Manish Amde] added categorical variable test bce835f [Manish Amde] code cleanup 7e5f08c [Manish Amde] minor doc 1dd2735 [Manish Amde] bin search logic for multiclass f16a9bb [Manish Amde] fixing while loop d811425 [Manish Amde] multiclass bin aggregate logic ab5cb21 [Manish Amde] multiclass logic d8e4a11 [Manish Amde] sample weights ed5a2df [Manish Amde] fixed classification requirements d012be7 [Manish Amde] fixed while loop 18d2835 [Manish Amde] changing default values for num classes 6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train 75f2bfc [Manish Amde] minor code style fix e547151 [Manish Amde] minor modifications 34549d0 [Manish Amde] fixing error during merge 098e8c5 [Manish Amde] merged master e006f9d [Manish Amde] changing variable names 5c78e1a [Manish Amde] added multiclass support 6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification 46e06ee [Manish Amde] minor mods 3f85a17 [Manish Amde] tests for multiclass classification 4d5f70c [Manish Amde] added multiclass support for find splits bins 46f909c [Manish Amde] todo for multiclass support 455bea9 [Manish Amde] fixed tests 14aea48 [Manish Amde] changing instance format to weighted labeled point a1a6e09 [Manish Amde] added weighted point class 968ca9d [Manish Amde] merged master 7fc9545 [Manish Amde] added docs ce004a1 [Manish Amde] minor formatting b27ad2c [Manish Amde] formatting 426bb28 [Manish Amde] programming guide blurb 8053fed [Manish Amde] more formatting 5eca9e4 [Manish Amde] grammar 4731cda [Manish Amde] formatting 5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation cbd9f14 [Manish Amde] modified scala.math to math dad9652 [Manish Amde] removed unused imports e0426ee [Manish Amde] renamed parameter 718506b [Manish Amde] added unit test 1517155 [Manish Amde] updated documentation 9dbdabe [Manish Amde] merge from master 719d009 [Manish Amde] updating user documentation fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree 0287772 [Evan Sparks] Fixing scalastyle issue. 2f1e093 [Manish Amde] minor: added doc for maxMemory parameter 2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree abc5a23 [Evan Sparks] Parameterizing max memory. 50b143a [Manish Amde] adding support for very deep trees	2014-07-18 14:00:13 -07:00
Reynold Xin	586e716e47	Reservoir sampling implementation. This is going to be used in https://issues.apache.org/jira/browse/SPARK-2568 Author: Reynold Xin <rxin@apache.org> Closes #1478 from rxin/reservoirSample and squashes the following commits: 17bcbf3 [Reynold Xin] Added seed. badf20d [Reynold Xin] Renamed the method. 6940010 [Reynold Xin] Reservoir sampling implementation.	2014-07-18 12:41:50 -07:00
Basit Mustafa	7f87ab9813	Added t2 instance types New t2 instance types require HVM amis, bailout assumption of pvm causes failures when using t2 instance types. Author: Basit Mustafa <basitmustafa@computes-things-for-basit.local> Closes #1446 from 24601/master and squashes the following commits: 01fe128 [Basit Mustafa] Makin' it pretty 392a95e [Basit Mustafa] Added t2 instance types	2014-07-18 12:23:47 -07:00
Sandy Ryza	30b8d369d4	SPARK-2553. Fix compile error Author: Sandy Ryza <sandy@cloudera.com> Closes #1479 from sryza/sandy-spark-2553 and squashes the following commits: 2cb5ed8 [Sandy Ryza] SPARK-2553. Fix compile error	2014-07-18 00:47:43 -07:00
Sandy Ryza	e52b8719cf	SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency... ... per key My humble opinion is that avoiding allocations in this performance-critical section is worth the extra code. Author: Sandy Ryza <sandy@cloudera.com> Closes #1461 from sryza/sandy-spark-2553 and squashes the following commits: 7eaf7f2 [Sandy Ryza] SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency per key	2014-07-17 23:57:08 -07:00
Cheng Hao	29809a6d58	[SPARK-2570] [SQL] Fix the bug of ClassCastException Exception thrown when running the example of HiveFromSpark. Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145) at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45) at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Cheng Hao <hao.cheng@intel.com> Closes #1475 from chenghao-intel/hive_from_spark and squashes the following commits: d4c0500 [Cheng Hao] Fix the bug of ClassCastException	2014-07-17 23:25:01 -07:00
Andrew Or	6afca2d107	[SPARK-2411] Add a history-not-found page to standalone Master Problem. Right now, if you click on an application after it has finished, it simply refreshes the page if there are no event logs for the application. This is not super intuitive especially because event logging is not enabled by default. We should direct the user to enable this if they attempt to view a SparkUI after the fact without event logs. Fix. The new page conveys different messages in each of the following scenarios: (1) Application did not enable event logging, (2) Event logs are not found in the specified directory, and (3) Exception is thrown while replaying the logs Here are screenshots of what the page looks like in each of the above scenarios: (1) <img src="https://issues.apache.org/jira/secure/attachment/12656204/Event%20logging%20not%20enabled.png" width="75%"> (2) <img src="https://issues.apache.org/jira/secure/attachment/12656203/Application%20history%20not%20found.png"> (3) <img src="https://issues.apache.org/jira/secure/attachment/12656202/Application%20history%20load%20error.png" width="95%"> Author: Andrew Or <andrewor14@gmail.com> Closes #1336 from andrewor14/master-link and squashes the following commits: 2f06206 [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link 97cddc0 [Andrew Or] Add different severity levels 832b687 [Andrew Or] Mention spark.eventLog.dir in error message 51980c3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link ded208c [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link 89d6405 [Andrew Or] Reword message e7df7ed [Andrew Or] Add a history not found page to standalone Master	2014-07-17 19:45:59 -07:00
Reynold Xin	72e9021eaf	[SPARK-2299] Consolidate various stageIdTo* hash maps in JobProgressListener This should reduce memory usage for the web ui as well as slightly increase its speed in draining the UI event queue. @andrewor14 Author: Reynold Xin <rxin@apache.org> Closes #1262 from rxin/ui-consolidate-hashtables and squashes the following commits: 1ac3f97 [Reynold Xin] Oops. Properly handle description. f5736ad [Reynold Xin] Code review comments. b8828dc [Reynold Xin] Merge branch 'master' into ui-consolidate-hashtables 7a7b6c4 [Reynold Xin] Revert css change. f959bb8 [Reynold Xin] [SPARK-2299] Consolidate various stageIdTo* hash maps in JobProgressListener to speed it up. 63256f5 [Reynold Xin] [SPARK-2320] Reduce <pre> block font size.	2014-07-17 18:58:48 -07:00
Joseph K. Bradley	935fe65ff6	SPARK-1215 [MLLIB]: Clustering: Index out of bounds error (2) Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. (Re-submitting PR after tangling commits in PR 1407 https://github.com/apache/spark/pull/1407 ) Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1468 from jkbradley/kmeans-fix and squashes the following commits: 4e9bd1e [Joseph K. Bradley] Updated PR per comments from mengxr 6c7a2ec [Joseph K. Bradley] Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite.	2014-07-17 15:05:02 -07:00
Sean Owen	1fcd5dcdd8	SPARK-1478.2 Fix incorrect NioServerSocketChannelFactory constructor call The line break inadvertently means this was interpreted as a call to the no-arg constructor. This doesn't exist in older Netty even. (Also fixed a val name typo.) Author: Sean Owen <srowen@gmail.com> Closes #1466 from srowen/SPARK-1478.2 and squashes the following commits: 59c3501 [Sean Owen] Line break caused Scala to interpret NioServerSocketChannelFactory constructor as the no-arg version, which is not even present in some versions of Netty	2014-07-17 12:20:48 -07:00
Reynold Xin	d988d345d5	[SPARK-2534] Avoid pulling in the entire RDD in various operators This should go into both master and branch-1.0. Author: Reynold Xin <rxin@apache.org> Closes #1450 from rxin/agg-closure and squashes the following commits: e40f363 [Reynold Xin] Mima check excludes. 9186364 [Reynold Xin] Define the return type more explicitly. 38e348b [Reynold Xin] Fixed the cases in RDD.scala. ea6b34d [Reynold Xin] Blah 89b9c43 [Reynold Xin] Fix other instances of accidentally pulling in extra stuff in closures. 73b2783 [Reynold Xin] [SPARK-2534] Avoid pulling in the entire RDD in groupByKey.	2014-07-17 10:54:53 -07:00
Andrew Or	9c73822a08	[SPARK-2423] Clean up SparkSubmit for readability It is currently non-trivial to trace through how different combinations of cluster managers (e.g. yarn) and deploy modes (e.g. cluster) are processed in SparkSubmit. Moving forward, it will be easier to extend SparkSubmit if we first re-organize the code by grouping related logic together. This is a precursor to fixing standalone-cluster mode, which is currently broken (SPARK-2260). Author: Andrew Or <andrewor14@gmail.com> Closes #1349 from andrewor14/submit-cleanup and squashes the following commits: 8f99200 [Andrew Or] script -> program (minor) 30f2e65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-cleanup fe484a1 [Andrew Or] Move deploy mode checks after yarn code 7167824 [Andrew Or] Re-order config options and update comments 0b01ff8 [Andrew Or] Clean up SparkSubmit for readability	2014-07-17 01:13:32 -07:00
Patrick Wendell	d0ea496877	SPARK-2526: Simplify options in make-distribution.sh Right now we have a bunch of parallel logic in make-distribution.sh that's just extra work to maintain. We should just pass through Maven profiles in this case and keep the script simple. See the JIRA for more details. Author: Patrick Wendell <pwendell@gmail.com> Closes #1445 from pwendell/make-distribution.sh and squashes the following commits: f1294ea [Patrick Wendell] Simplify options in make-distribution.sh.	2014-07-17 01:02:35 -07:00
Aaron Davidson	7c23c0dc3e	[SPARK-2412] CoalescedRDD throws exception with certain pref locs If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as "groupHash.get(nxt_replica).get" is not valid. The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist. Author: Aaron Davidson <aaron@databricks.com> Closes #1337 from aarondav/2412 and squashes the following commits: f587b5d [Aaron Davidson] getOrElseUpdate 3ad8a3c [Aaron Davidson] [SPARK-2412] CoalescedRDD throws exception with certain pref locs	2014-07-17 01:01:14 -07:00
Aaron Davidson	9c249743ea	[SPARK-2154] Schedule next Driver when one completes (standalone mode) Author: Aaron Davidson <aaron@databricks.com> Closes #1405 from aarondav/2154 and squashes the following commits: 24e9ef9 [Aaron Davidson] [SPARK-2154] Schedule next Driver when one completes (standalone mode)	2014-07-16 14:16:48 -07:00
Aaron Davidson	8867cd0bc2	SPARK-1097: Do not introduce deadlock while fixing concurrency bug We recently added this lock on 'conf' in order to prevent concurrent creation. However, it turns out that this can introduce a deadlock because Hadoop also synchronizes on the Configuration objects when creating new Configurations (and they do so via a static REGISTRY which contains all created Configurations). This fix forces all Spark initialization of Configuration objects to occur serially by using a static lock that we control, and thus also prevents introducing the deadlock. Author: Aaron Davidson <aaron@databricks.com> Closes #1409 from aarondav/1054 and squashes the following commits: 7d1b769 [Aaron Davidson] SPARK-1097: Do not introduce deadlock while fixing concurrency bug	2014-07-16 14:10:17 -07:00
Reynold Xin	7c8d123225	[SPARK-2317] Improve task logging. We use TID to indicate task logging. However, TID itself does not capture stage or retries, making it harder to correlate with the application itself. This pull request changes all logging messages for tasks to include both the TID and the stage id, stage attempt, task id, and task attempt. I've consulted various people but unfortunately this is a really hard task. Driver log looks like: ``` 14/06/28 18:53:29 INFO DAGScheduler: Submitting 10 missing tasks from Stage 0 (MappedRDD[1] at map at <console>:13) 14/06/28 18:53:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks 14/06/28 18:53:29 INFO TaskSetManager: Re-computing pending task lists. 14/07/15 19:44:40 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 4, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 5, localhost, PROCESS_LOCAL, 1855 bytes) 14/07/15 19:44:40 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 6, localhost, PROCESS_LOCAL, 1855 bytes) ... 14/07/15 19:44:40 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 64 ms on localhost (4/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 4) in 63 ms on localhost (5/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 2) in 63 ms on localhost (6/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 7) in 62 ms on localhost (7/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 6) in 63 ms on localhost (8/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 9) in 8 ms on localhost (9/10) 14/07/15 19:44:40 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 8) in 9 ms on localhost (10/10) ``` Executor log looks like ``` 14/07/15 19:44:40 INFO Executor: Running task 0.0 in stage 1.0 (TID 0) 14/07/15 19:44:40 INFO Executor: Running task 3.0 in stage 1.0 (TID 3) 14/07/15 19:44:40 INFO Executor: Running task 1.0 in stage 1.0 (TID 1) 14/07/15 19:44:40 INFO Executor: Running task 4.0 in stage 1.0 (TID 4) 14/07/15 19:44:40 INFO Executor: Running task 2.0 in stage 1.0 (TID 2) 14/07/15 19:44:40 INFO Executor: Running task 5.0 in stage 1.0 (TID 5) 14/07/15 19:44:40 INFO Executor: Running task 6.0 in stage 1.0 (TID 6) 14/07/15 19:44:40 INFO Executor: Running task 7.0 in stage 1.0 (TID 7) 14/07/15 19:44:40 INFO Executor: Finished task 3.0 in stage 1.0 (TID 3). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 2.0 in stage 1.0 (TID 2). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 5.0 in stage 1.0 (TID 5). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 4.0 in stage 1.0 (TID 4). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 6.0 in stage 1.0 (TID 6). 847 bytes result sent to driver 14/07/15 19:44:40 INFO Executor: Finished task 7.0 in stage 1.0 (TID 7). 847 bytes result sent to driver ``` Author: Reynold Xin <rxin@apache.org> Closes #1259 from rxin/betterTaskLogging and squashes the following commits: c28ada1 [Reynold Xin] Fix unit test failure. 987d043 [Reynold Xin] Updated log messages. c6cfd46 [Reynold Xin] Merge branch 'master' into betterTaskLogging b7b1bcc [Reynold Xin] Fixed a typo. f9aba3c [Reynold Xin] Made it compile. f8a5c06 [Reynold Xin] Merge branch 'master' into betterTaskLogging 07264e6 [Reynold Xin] Defensive check against unknown TaskEndReason. 76bbd18 [Reynold Xin] FailureSuite not serializable reporting. 4659b20 [Reynold Xin] Remove unused variable. 53888e3 [Reynold Xin] [SPARK-2317] Improve task logging.	2014-07-16 11:50:49 -07:00
James Z.M. Gao	caa163f086	fix compile error of streaming project explicit return type for implicit function Author: James Z.M. Gao <gaozhm@mediav.com> Closes #153 from gzm55/work/streaming-compile and squashes the following commits: 11e9c8d [James Z.M. Gao] fix style error fe88109 [James Z.M. Gao] fix compile error of streaming project	2014-07-16 11:35:21 -07:00
Xiangrui Meng	96f28c9726	[SPARK-2522] set default broadcast factory to torrent HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method. Author: Xiangrui Meng <meng@databricks.com> Closes #1437 from mengxr/bt-broadcast and squashes the following commits: ed492fe [Xiangrui Meng] set default broadcast factory to torrent	2014-07-16 11:27:51 -07:00
Reynold Xin	ef48222c10	[SPARK-2517] Remove some compiler warnings. Author: Reynold Xin <rxin@apache.org> Closes #1433 from rxin/compile-warning and squashes the following commits: 8d0b890 [Reynold Xin] Remove some compiler warnings.	2014-07-16 11:15:07 -07:00
Takuya UESHIN	cc965eea51	[SPARK-2518][SQL] Fix foldability of Substring expression. This is a follow-up of #1428. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1432 from ueshin/issues/SPARK-2518 and squashes the following commits: 37d1ace [Takuya UESHIN] Fix foldability of Substring expression.	2014-07-16 11:13:38 -07:00

1 2 3 4 5 ...

7477 commits