ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
MechCoder	20bb10f864	[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark This adds Pylint checks to PySpark. For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script). We still need to figure out what rules to be allowed. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7241 from MechCoder/pylint and squashes the following commits: 2fc7291 [MechCoder] Remove pylint test fail 6d883a2 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins f3a5e17 [MechCoder] undefined-variable ca8b749 [MechCoder] Minor changes 71629f8 [MechCoder] remove trailing whitespace 8498ff9 [MechCoder] Remove blacklisted arguments and pointless statements check 1dbd094 [MechCoder] Disable all checks for now 8b8aa8a [MechCoder] Add pylint configuration file 7871bb1 [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark	2015-07-15 08:25:53 -07:00
zsxwing	adb33d3665	[SPARK-9012] [WEBUI] Escape Accumulators in the task table If running the following codes, the task table will be broken because accumulators aren't escaped. ``` val a = sc.accumulator(1, "<table>") sc.parallelize(1 to 10).foreach(i => a += i) ``` Before this fix, <img width="1348" alt="screen shot 2015-07-13 at 8 02 44 pm" src="https://cloud.githubusercontent.com/assets/1000778/8649295/b17c491e-299b-11e5-97ee-4e6a64074c4f.png"> After this fix, <img width="1355" alt="screen shot 2015-07-13 at 8 14 32 pm" src="https://cloud.githubusercontent.com/assets/1000778/8649337/f9e9c9ec-299b-11e5-927e-35c0a2f897f5.png"> Author: zsxwing <zsxwing@gmail.com> Closes #7369 from zsxwing/SPARK-9012 and squashes the following commits: a83c9b6 [zsxwing] Escape Accumulators in the task table	2015-07-15 17:30:57 +09:00
Reynold Xin	14935d846a	[HOTFIX][SQL] Unit test breaking.	2015-07-15 00:12:21 -07:00
Feynman Liang	1bb8accbc9	[SPARK-8997] [MLLIB] Performance improvements in LocalPrefixSpan Improves the performance of LocalPrefixSpan by implementing optimizations proposed in [SPARK-8997](https://issues.apache.org/jira/browse/SPARK-8997) Author: Feynman Liang <fliang@databricks.com> Author: Feynman Liang <feynman.liang@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #7360 from feynmanliang/SPARK-8997-improve-prefixspan and squashes the following commits: 59db2f5 [Feynman Liang] Merge pull request #1 from mengxr/SPARK-8997 91e4357 [Xiangrui Meng] update LocalPrefixSpan impl 9212256 [Feynman Liang] MengXR code review comments f055d82 [Feynman Liang] Fix failing scalatest 2e00cba [Feynman Liang] Depth first projections 70b93e3 [Feynman Liang] Performance improvements in LocalPrefixSpan, fix tests	2015-07-14 23:50:57 -07:00
Yijie Shen	f0e129740d	[SPARK-8279][SQL]Add math function round JIRA: https://issues.apache.org/jira/browse/SPARK-8279 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6938 from yijieshen/udf_round_3 and squashes the following commits: 07a124c [Yijie Shen] remove useless def children 392b65b [Yijie Shen] add negative scale test in DecimalSuite 61760ee [Yijie Shen] address reviews 302a78a [Yijie Shen] Add dataframe function test 31dfe7c [Yijie Shen] refactor round to make it readable 8c7a949 [Yijie Shen] rebase & inputTypes update 9555e35 [Yijie Shen] tiny style fix d10be4a [Yijie Shen] use TypeCollection to specify wanted input and implicit cast c3b9839 [Yijie Shen] rely on implict cast to handle string input b0bff79 [Yijie Shen] make round's inner method's name more meaningful 9bd6930 [Yijie Shen] revert accidental change e6f44c4 [Yijie Shen] refactor eval and genCode 1b87540 [Yijie Shen] modify checkInputDataTypes using foldable 5486b2d [Yijie Shen] DataFrame API modification 2077888 [Yijie Shen] codegen versioned eval 6cd9a64 [Yijie Shen] refactor Round's constructor 9be894e [Yijie Shen] add round functions in o.a.s.sql.functions 7c83e13 [Yijie Shen] more tests on round 56db4bb [Yijie Shen] Add decimal support to Round 7e163ae [Yijie Shen] style fix 653d047 [Yijie Shen] Add math function round	2015-07-14 23:30:41 -07:00
FlytxtRnD	3f6296fed4	[SPARK-8018] [MLLIB] KMeans should accept initial cluster centers as param This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits: 94b56df [FlytxtRnD] style correction ef95ee2 [FlytxtRnD] style correction c446c58 [FlytxtRnD] documentation and numRuns warning change 06d13ef [FlytxtRnD] numRuns corrected d12336e [FlytxtRnD] numRuns variable modifications 07f8554 [FlytxtRnD] remove setRuns from setIntialModel e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018 242ead1 [FlytxtRnD] corrected == to === in assert 714acb5 [FlytxtRnD] added numRuns 60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed 582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018 3f5fc8e [FlytxtRnD] test case modified and one runs condition added cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018 16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018 e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria 6959861 [FlytxtRnD] Accept initial cluster centers in KMeans	2015-07-14 23:29:02 -07:00
Yu ISHIKAWA	4692769655	[SPARK-6259] [MLLIB] Python API for LDA I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue. TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits: 6855f59 [Yu ISHIKAWA] LDA inherits object 28bd165 [Yu ISHIKAWA] Change the place of testing code d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value 083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer` 9f8bed8 [Yu ISHIKAWA] Simplify casting faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters 98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented. 57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing 73412c3 [Yu ISHIKAWA] Fix the typo 2278829 [Yu ISHIKAWA] Fix the indentation 39514ec [Yu ISHIKAWA] Modify how to cast the input data 8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala` 77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint 68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python 25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing	2015-07-14 23:27:42 -07:00
Michael Armbrust	c6b1a9e74e	Revert SPARK-6910 and SPARK-9027 Revert #7216 and #7386. These patch seems to be causing quite a few test failures: ``` Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor322.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:351) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:318) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:180) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:135) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:172) at org.apache.spark.sql.hive.client.ClientWrapper.getPartitionsByFilter(ClientWrapper.scala:318) at org.apache.spark.sql.hive.client.HiveTable.getPartitions(ClientInterface.scala:78) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(HiveMetastoreCatalog.scala:670) at org.apache.spark.sql.hive.execution.HiveTableScan.doExecute(HiveTableScan.scala:137) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:89) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:164) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:151) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 85 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:185) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:452) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:357) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:279) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$TreeNode.generateJDOFilter(ExpressionTree.java:243) at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:590) at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:2417) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2029) at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:146) at org.apache.hadoop.hive.metastore.ObjectStore$4.getJdoResult(ObjectStore.java:2332) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-with-YARN/2945/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/ Author: Michael Armbrust <michael@databricks.com> Closes #7409 from marmbrus/revertMetastorePushdown and squashes the following commits: 92fabd3 [Michael Armbrust] Revert SPARK-6910 and SPARK-9027 5d3bdf2 [Michael Armbrust] Revert "[SPARK-9027] [SQL] Generalize metastore predicate pushdown"	2015-07-14 22:57:39 -07:00
Reynold Xin	f23a721c10	[SPARK-8993][SQL] More comprehensive type checking in expressions. This patch makes the following changes: 1. ExpectsInputTypes only defines expected input types, but does not perform any implicit type casting. 2. ImplicitCastInputTypes is a new trait that defines both expected input types, as well as performs implicit type casting. 3. BinaryOperator has a new abstract function "inputType", which defines the expected input type for both left/right. Concrete BinaryOperator expressions no longer perform any implicit type casting. 4. For BinaryOperators, convert NullType (i.e. null literals) into some accepted type so BinaryOperators don't need to handle NullTypes. TODOs needed: fix unit tests for error reporting. I'm intentionally not changing anything in aggregate expressions because yhuai is doing a big refactoring on that right now. Author: Reynold Xin <rxin@databricks.com> Closes #7348 from rxin/typecheck and squashes the following commits: 8fcf814 [Reynold Xin] Fixed ordering of cases. 3bb63e7 [Reynold Xin] Style fix. f45408f [Reynold Xin] Comment update. aa7790e [Reynold Xin] Moved RemoveNullTypes into ImplicitTypeCasts. 438ea07 [Reynold Xin] space d55c9e5 [Reynold Xin] Removes NullTypes. 360d124 [Reynold Xin] Fixed the rule. fb66657 [Reynold Xin] Convert NullType into some accepted type for BinaryOperators. 2e22330 [Reynold Xin] Fixed unit tests. 4932d57 [Reynold Xin] Style fix. d061691 [Reynold Xin] Rename existing ExpectsInputTypes -> ImplicitCastInputTypes. e4727cc [Reynold Xin] BinaryOperator should not be doing implicit cast. d017861 [Reynold Xin] Improve expression type checking.	2015-07-14 22:52:53 -07:00
Sun Rui	f650a005e0	[SPARK-8808] [SPARKR] Fix assignments in SparkR. Author: Sun Rui <rui.sun@intel.com> Closes #7395 from sun-rui/SPARK-8808 and squashes the following commits: ce603bc [Sun Rui] Use '<-' instead of '='. 88590b1 [Sun Rui] Use '<-' instead of '='.	2015-07-14 22:21:01 -07:00
Patrick Wendell	5572fd0c51	[HOTFIX] Adding new names to known contributors	2015-07-14 21:44:47 -07:00
jerryshao	bb870e72f4	[SPARK-5523] [CORE] [STREAMING] Add a cache for hostname in TaskMetrics to decrease the memory usage and GC overhead Hostname in TaskMetrics will be created through deserialization, mostly the number of hostname is only the order of number of cluster node, so adding a cache layer to dedup the object could reduce the memory usage and alleviate GC overhead, especially for long-running and fast job generation applications like Spark Streaming. Author: jerryshao <saisai.shao@intel.com> Author: Saisai Shao <saisai.shao@intel.com> Closes #5064 from jerryshao/SPARK-5523 and squashes the following commits: 3e2412a [jerryshao] Address the comments b092a81 [Saisai Shao] Add a pool to cache the hostname	2015-07-14 19:54:02 -07:00
huangzhaowei	f957796c4b	[SPARK-8820] [STREAMING] Add a configuration to set checkpoint dir. Add a configuration to set checkpoint directory for convenience to user. [Jira Address](https://issues.apache.org/jira/browse/SPARK-8820) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #7218 from SaintBacchus/SPARK-8820 and squashes the following commits: d49fe4b [huangzhaowei] Rename the configuration name 66ea47c [huangzhaowei] Add the unit test. dd0acc1 [huangzhaowei] [SPARK-8820][Streaming] Add a configuration to set checkpoint dir.	2015-07-14 19:20:49 -07:00
Josh Rosen	cc57d705e7	[SPARK-9050] [SQL] Remove unused newOrdering argument from Exchange (cleanup after SPARK-8317) SPARK-8317 changed the SQL Exchange operator so that it no longer pushed sorting into Spark's shuffle layer, a change which allowed more efficient SQL-specific sorters to be used. This patch performs some leftover cleanup based on those changes: - Exchange's constructor should no longer accept a `newOrdering` since it's no longer used and no longer works as expected. - `addOperatorsIfNecessary` looked at shuffle input's output ordering to decide whether to sort, but this is the wrong node to be examining: it needs to look at whether the post-shuffle node has the right ordering, since shuffling will not preserve row orderings. Thanks to davies for spotting this. Author: Josh Rosen <joshrosen@databricks.com> Closes #7407 from JoshRosen/SPARK-9050 and squashes the following commits: e70be50 [Josh Rosen] No need to wrap line e866494 [Josh Rosen] Refactor addOperatorsIfNecessary to make code clearer 2e467da [Josh Rosen] Remove `newOrdering` from Exchange.	2015-07-14 18:55:34 -07:00
Josh Rosen	e965a798d0	[SPARK-9045] Fix Scala 2.11 build break in UnsafeExternalRowSorter This fixes a compilation break in under Scala 2.11: ``` [error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135: error: <anonymous org.apache.spark.sql.execution.UnsafeExternalRowSorter$1> is not abstract and does not override abstract method <B>minBy(Function1<InternalRow,B>,Ordering<B>) in TraversableOnce [error] return new AbstractScalaRowIterator() { [error] ^ [error] where B,A are type-variables: [error] B extends Object declared in method <B>minBy(Function1<A,B>,Ordering<B>) [error] A extends Object declared in interface TraversableOnce [error] 1 error ``` The workaround for this is to make `AbstractScalaRowIterator` into a concrete class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7405 from JoshRosen/SPARK-9045 and squashes the following commits: cbcbb4c [Josh Rosen] Forgot that we can't use the ??? operator anymore 577ba60 [Josh Rosen] [SPARK-9045] Fix Scala 2.11 build break in UnsafeExternalRowSorter.	2015-07-14 17:21:48 -07:00
Josh Rosen	11e5c37286	[SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly. `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits: e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName c0b7885 [Josh Rosen] Hopefully fix the last two cases d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass 046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName 62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion. d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName	2015-07-14 16:08:17 -07:00
Sean Owen	740b034f1c	[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel Add predictProbabilities to Naive Bayes, return class probabilities. Continues https://github.com/apache/spark/pull/6761 Author: Sean Owen <sowen@cloudera.com> Closes #7376 from srowen/SPARK-4362 and squashes the following commits: 23d5a76 [Sean Owen] Fix model.labels -> model.theta 95d91fb [Sean Owen] Check that predicted probabilities sum to 1 b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities	2015-07-14 22:44:54 +01:00
Liang-Chi Hsieh	4b5cfc988f	[SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation JIRA: https://issues.apache.org/jira/browse/SPARK-8800 Previously, we turn to Java BigDecimal's divide with specified ROUNDING_MODE to avoid non-terminating decimal expansion problem. However, as JihongMA reported, for the division operation on some specific values, we get inaccurate results. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7212 from viirya/fix_decimal4 and squashes the following commits: 4205a0a [Liang-Chi Hsieh] Fix inaccuracy precision/scale of Decimal division operation.	2015-07-14 14:19:27 -07:00
zsxwing	fb1d06fc24	[SPARK-4072] [CORE] Display Streaming blocks in Streaming UI Replace #6634 This PR adds `SparkListenerBlockUpdated` to SparkListener so that it can monitor all block update infos that are sent to `BlockManagerMasaterEndpoint`, and also add new tables in the Storage tab to display the stream block infos. ![screen shot 2015-07-01 at 5 19 46 pm](https://cloud.githubusercontent.com/assets/1000778/8451562/c291a6ec-2016-11e5-890d-0afc174e1f8c.png) Author: zsxwing <zsxwing@gmail.com> Closes #6672 from zsxwing/SPARK-4072-2 and squashes the following commits: df2c1d8 [zsxwing] Use xml query to check the xml elements 54d54af [zsxwing] Add unit tests for StoragePage e29fb53 [zsxwing] Update as per TD's comments ccbee07 [zsxwing] Fix the code style 6dc42b4 [zsxwing] Fix the replication level of blocks 450fad1 [zsxwing] Merge branch 'master' into SPARK-4072-2 1e9ef52 [zsxwing] Don't categorize by Executor ID ca0ab69 [zsxwing] Fix the code style 3de2762 [zsxwing] Make object BlockUpdatedInfo private e95b594 [zsxwing] Add 'Aggregated Stream Block Metrics by Executor' table ba5d0d1 [zsxwing] Refactor the unit test to improve the readability 4bbe341 [zsxwing] Revert JsonProtocol and don't log SparkListenerBlockUpdated b464dd1 [zsxwing] Add onBlockUpdated to EventLoggingListener 5ba014c [zsxwing] Fix the code style 0b1e47b [zsxwing] Add a developer api BlockUpdatedInfo 04838a9 [zsxwing] Fix the code style 2baa161 [zsxwing] Add unit tests 80f6c6d [zsxwing] Address comments 797ee4b [zsxwing] Display Streaming blocks in Streaming UI	2015-07-14 13:58:36 -07:00
Andrew Ray	0a4071eab3	[SPARK-8718] [GRAPHX] Improve EdgePartition2D for non perfect square number of partitions See https://github.com/aray/e2d/blob/master/EdgePartition2D.ipynb Author: Andrew Ray <ray.andrew@gmail.com> Closes #7104 from aray/edge-partition-2d-improvement and squashes the following commits: 3729f84 [Andrew Ray] correct bounds and remove unneeded comments 97f8464 [Andrew Ray] change less 5141ab4 [Andrew Ray] Merge branch 'master' into edge-partition-2d-improvement 925fd2c [Andrew Ray] use new interface for partitioning 001bfd0 [Andrew Ray] Refactor PartitionStrategy so that we can return a prtition function for a given number of parts. To keep compatibility we define default methods that translate between the two implementation options. Made EdgePartition2D use old strategy when we have a perfect square and implement new interface. 5d42105 [Andrew Ray] % -> / 3560084 [Andrew Ray] Merge branch 'master' into edge-partition-2d-improvement f006364 [Andrew Ray] remove unneeded comments cfa2c5e [Andrew Ray] Modifications to EdgePartition2D so that it works for non perfect squares.	2015-07-14 13:14:47 -07:00
Josh Rosen	d267c2834a	[SPARK-9031] Merge BlockObjectWriter and DiskBlockObject writer to remove abstract class BlockObjectWriter has only one concrete non-test class, DiskBlockObjectWriter. In order to simplify the code in preparation for other refactorings, I think that we should remove this base class and have only DiskBlockObjectWriter. While at one time we may have planned to have multiple BlockObjectWriter implementations, that doesn't seem to have happened, so the extra abstraction seems unnecessary. Author: Josh Rosen <joshrosen@databricks.com> Closes #7391 from JoshRosen/shuffle-write-interface-refactoring and squashes the following commits: c418e33 [Josh Rosen] Fix compilation 5047995 [Josh Rosen] Fix comments d5dc548 [Josh Rosen] Update references in comments 89dc797 [Josh Rosen] Rename test suite. 5755918 [Josh Rosen] Remove unnecessary val in case class 1607c91 [Josh Rosen] Merge BlockObjectWriter and DiskBlockObjectWriter	2015-07-14 12:56:17 -07:00
Andrew Or	8fb3a65cbb	[SPARK-8911] Fix local mode endless heartbeats As of #7173 we expect executors to properly register with the driver before responding to their heartbeats. This behavior is not matched in local mode. This patch adds the missing event that needs to be posted. Author: Andrew Or <andrew@databricks.com> Closes #7382 from andrewor14/fix-local-heartbeat and squashes the following commits: 1258bdf [Andrew Or] Post ExecutorAdded event to local executor	2015-07-14 12:47:11 -07:00
Brennon York	c4e98ff066	[SPARK-8933] [BUILD] Provide a --force flag to build/mvn that always uses downloaded maven added --force flag to manually download, if necessary, and use a built-in version of maven best for spark Author: Brennon York <brennon.york@capitalone.com> Closes #7374 from brennonyork/SPARK-8933 and squashes the following commits: d673127 [Brennon York] added --force flag to manually download, if necessary, and use a built-in version of maven best for spark	2015-07-14 11:43:26 -07:00
Michael Armbrust	37f2d9635f	[SPARK-9027] [SQL] Generalize metastore predicate pushdown Add support for pushing down metastore filters that are in different orders and add some unit tests. Author: Michael Armbrust <michael@databricks.com> Closes #7386 from marmbrus/metastoreFilters and squashes the following commits: 05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown	2015-07-14 11:22:09 -07:00
Wenchen Fan	59d820aa8d	[SPARK-9029] [SQL] shortcut CaseKeyWhen if key is null Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7389 from cloud-fan/case-when and squashes the following commits: ea4b6ba [Wenchen Fan] shortcut for case key when	2015-07-14 10:20:15 -07:00
Daoyuan Wang	257236c3e1	[SPARK-6851] [SQL] function least/greatest follow up This is a follow up of remaining comments from #6851 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7387 from adrian-wang/udflgfollow and squashes the following commits: 6163e62 [Daoyuan Wang] add skipping null values e8c2e09 [Daoyuan Wang] use seq 8362966 [Daoyuan Wang] pr6851 follow up	2015-07-14 01:09:33 -07:00
zhaishidan	c1feebd8fc	[SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about `spark.kryoserializer.buffer` The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.". The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. Author: zhaishidan <zhaishidan@haizhi.com> Closes #7393 from stanzhai/master and squashes the following commits: 69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb	2015-07-14 08:54:30 +01:00
Joseph Gonzalez	20c1434a8d	[SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt doc These are minor corrections in the documentation of several classes that are preventing: ```bash build/sbt publish-local ``` I believe this might be an issue associated with running JDK8 as ankurdave does not appear to have this issue in JDK7. Author: Joseph Gonzalez <joseph.e.gonzalez@gmail.com> Closes #7354 from jegonzal/FixingJavadocErrors and squashes the following commits: 6664b7e [Joseph Gonzalez] making requested changes 2e16d89 [Joseph Gonzalez] Fixing errors in javadocs that prevents build/sbt publish-local from completing.	2015-07-14 00:32:29 -07:00
Cheolsoo Park	408b384de9	[SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review. The changes include- * Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore. * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string. * Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter. * Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string. But there are some cases in which predicate pushdown is disabled- Case \| Predicate pushdown ------- \| ----------------------------- Hive integral and string types \| Yes Hive varchar type \| No Hive 0.13 and newer \| Yes Hive 0.12 and older \| No convertMetastoreParquet=false \| Yes convertMetastoreParquet=true \| No In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong. cc marmbrus Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits: aa1490f [Cheolsoo Park] Fix ordering of imports c212c4d [Cheolsoo Park] Incorporate review comments 5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore	2015-07-13 19:45:10 -07:00
Neelesh Srinivas Salian	b7bcbe25f9	[SPARK-8743] [STREAMING] Deregister Codahale metrics for streaming when StreamingContext is closed The issue link: https://issues.apache.org/jira/browse/SPARK-8743 Deregister Codahale metrics for streaming when StreamingContext is closed Design: Adding the method calls in the appropriate start() and stop () methods for the StreamingContext Actions in the PullRequest: 1) Added the registerSource method call to the start method for the Streaming Context. 2) Added the removeSource method to the stop method. 3) Added comments for both 1 and 2 and comment to show initialization of the StreamingSource 4) Added a test case to check for both registration and de-registration of metrics Previous closed PR for reference: https://github.com/apache/spark/pull/7250 Author: Neelesh Srinivas Salian <nsalian@cloudera.com> Closes #7362 from nssalian/branch-SPARK-8743 and squashes the following commits: 7d998a3 [Neelesh Srinivas Salian] Removed the Thread.sleep() call 8b26397 [Neelesh Srinivas Salian] Moved the scalatest.{} import 0e8007a [Neelesh Srinivas Salian] moved import org.apache.spark{} to correct place daedaa5 [Neelesh Srinivas Salian] Corrected Ordering of imports 8873180 [Neelesh Srinivas Salian] Removed redundancy in imports 59227a4 [Neelesh Srinivas Salian] Changed the ordering of the imports to classify scala and spark imports d8cb577 [Neelesh Srinivas Salian] Added registerSource to start() and removeSource to stop(). Wrote a test to check the registration and de-registration	2015-07-13 15:46:51 -07:00
Hari Shreedharan	0aed38e449	[SPARK-8533] [STREAMING] Upgrade Flume to 1.6.0 Author: Hari Shreedharan <hshreedharan@apache.org> Closes #6939 from harishreedharan/upgrade-flume-1.6.0 and squashes the following commits: 94b80ae [Hari Shreedharan] [SPARK-8533][Streaming] Upgrade Flume to 1.6.0	2015-07-13 14:15:31 -07:00
Vinod K C	4c797f2b09	[SPARK-8636] [SQL] Fix equalNullSafe comparison Author: Vinod K C <vinod.kc@huawei.com> Closes #7040 from vinodkc/fix_CaseKeyWhen_equalNullSafe and squashes the following commits: be5e641 [Vinod K C] Renamed equalNullSafe to threeValueEquals aac9f67 [Vinod K C] Updated test suite and genCode method f2d0b53 [Vinod K C] Fix equalNullSafe comparison	2015-07-13 12:51:33 -07:00
Vinod K C	714fc55f4a	[SPARK-8991] [ML] Update SharedParamsCodeGen's Generated Documentation Removed private[ml] from Generated documentation Author: Vinod K C <vinod.kc@huawei.com> Closes #7367 from vinodkc/fix_sharedparmascodegen and squashes the following commits: 4fa3c8f [Vinod K C] Adding auto generated code 7e19025 [Vinod K C] Removed private[ml]	2015-07-13 12:03:39 -07:00
yongtang	5c41691fe4	[SPARK-8954] [BUILD] Remove unneeded deb repository from Dockerfile to fix build error in docker. [SPARK-8954] [Build] 1. Remove unneeded deb repository from Dockerfile to fix build error in docker. 2. Remove unneeded /var/lib/apt/lists/* after install to reduce the docker image size (by ~30MB). Author: yongtang <yongtang@users.noreply.github.com> Closes #7346 from yongtang/SPARK-8954 and squashes the following commits: 36024a1 [yongtang] [SPARK-8954] [Build] Remove unneeded /var/lib/apt/lists/* after install to reduce the docker image size (by ~30MB) 7084941 [yongtang] [SPARK-8954] [Build] Remove unneeded deb repository from Dockerfile to fix build error in docker.	2015-07-13 12:01:23 -07:00
Davies Liu	79c35826e6	Revert "[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark" This reverts commit `9b62e9375f`.	2015-07-13 11:30:36 -07:00
Carson Wang	5ca26fb64d	[SPARK-8950] [WEBUI] Correct the calculation of SchedulerDelay in StagePage In StagePage, the SchedulerDelay is calculated as totalExecutionTime - executorRunTime - executorOverhead - gettingResultTime. But the totalExecutionTime is calculated in the way that doesn't include the gettingResultTime. Author: Carson Wang <carson.wang@intel.com> Closes #7319 from carsonwang/SchedulerDelayTime and squashes the following commits: f66fb6e [Carson Wang] Update the code style 7d971ae [Carson Wang] Correct the calculation of SchedulerDelay	2015-07-13 11:20:04 -07:00
MechCoder	9b62e9375f	[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark This adds Pylint checks to PySpark. For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script). We still need to figure out what rules to be allowed. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7241 from MechCoder/pylint and squashes the following commits: 8496834 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins 57393a3 [MechCoder] undefined-variable a8e2547 [MechCoder] Minor changes 7753810 [MechCoder] remove trailing whitespace 75c5d2b [MechCoder] Remove blacklisted arguments and pointless statements check 6bde250 [MechCoder] Disable all checks for now 3464666 [MechCoder] Add pylint configuration file d28109f [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark	2015-07-13 09:47:53 -07:00
Sun Rui	7f487c8bde	[SPARK-6797] [SPARKR] Add support for YARN cluster mode. This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node. This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed. This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue. This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue. R/install-dev.bat is not tested. shivaram , Could you help to test it? Author: Sun Rui <rui.sun@intel.com> Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits: ca63c86 [Sun Rui] Adjust MimaExcludes after rebase. 7313374 [Sun Rui] Fix unit test errors. 72695fb [Sun Rui] Fix unit test failures. 193882f [Sun Rui] Fix Mima test error. fe25a33 [Sun Rui] Fix Mima test error. 35ecfa3 [Sun Rui] Fix comments. c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes. b05340c [Sun Rui] Fix scala style. 2ca5048 [Sun Rui] Fix comments. 1acefd1 [Sun Rui] Fix scala style. `0aa1e97` [Sun Rui] Fix scala style. 41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs. 49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat. 7b916c5 [Sun Rui] Use 'rem' consistently. 3bed438 [Sun Rui] Add a comment. 681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes. cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.	2015-07-13 08:21:47 -07:00
Vincent D. Warmerdam	a5bc803b72	[SPARK-8596] Add module for rstudio link to spark shivaram, added module for rstudio install Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com> Closes #7366 from koaning/rstudio-install and squashes the following commits: e47c2da [Vincent D. Warmerdam] added rstudio module	2015-07-13 08:15:54 -07:00
Wenchen Fan	6b89943834	[SPARK-8944][SQL] Support casting between IntervalType and StringType Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7355 from cloud-fan/fromString and squashes the following commits: 3bbb9d6 [Wenchen Fan] fix code gen 7dab957 [Wenchen Fan] naming fix 0fbbe19 [Wenchen Fan] address comments ac1f3d1 [Wenchen Fan] Support casting between IntervalType and StringType	2015-07-13 00:49:39 -07:00
Daoyuan Wang	92540d22e4	[SPARK-8203] [SPARK-8204] [SQL] conditional function: least/greatest chenghao-intel zhichao-li qiansl127 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6851 from adrian-wang/udflg and squashes the following commits: 0f1bff2 [Daoyuan Wang] address comments from davis 7a6bdbb [Daoyuan Wang] add '.' for hex() c1f6824 [Daoyuan Wang] add codegen, test for all types ec625b0 [Daoyuan Wang] conditional function: least/greatest	2015-07-13 00:14:32 -07:00
Davies Liu	20b474335c	[SPARK-9006] [PYSPARK] fix microsecond loss in Python 3 It may loss a microsecond if using timestamp as float, should be `int` instead. Author: Davies Liu <davies@databricks.com> Closes #7363 from davies/fix_microsecond and squashes the following commits: 36f6007 [Davies Liu] fix microsecond loss in Python 3	2015-07-12 20:25:06 -07:00
Kay Ousterhout	30090884f9	[SPARK-8880] Fix confusing Stage.attemptId member variable Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #7275 from kayousterhout/SPARK-8880 and squashes the following commits: 3e9ce7c [Kay Ousterhout] Added missing return type e150278 [Kay Ousterhout] [SPARK-8880] Fix confusing Stage.attemptId member variable	2015-07-12 20:45:24 -04:00
Wenchen Fan	c472eb17ae	[SPARK-8970][SQL] remove unnecessary abstraction for ExtractValue Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7339 from cloud-fan/minor and squashes the following commits: 84a2128 [Wenchen Fan] remove unapply 6a37c12 [Wenchen Fan] remove unnecessary abstraction for ExtractValue	2015-07-10 23:25:11 -07:00
Joseph K. Bradley	0c5207c66d	[SPARK-8994] [ML] tiny cleanups to Params, Pipeline Made default impl of Params.validateParams empty CC mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits: 4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956	2015-07-10 21:25:09 -07:00
zhangjiajin	7f6be1f24d	[SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib Add parallel PrefixSpan algorithm and test file. Support non-temporal sequences. Author: zhangjiajin <zhangjiajin@huawei.com> Author: zhang jiajin <zhangjiajin@huawei.com> Closes #7258 from zhangjiajin/master and squashes the following commits: ca9c4c8 [zhangjiajin] Modified the code according to the review comments. 574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization. ba5df34 [zhangjiajin] Fix a Scala style error. 4c60fb3 [zhangjiajin] Fix some Scala style errors. 1dd33ad [zhangjiajin] Modified the code according to the review comments. 89bc368 [zhangjiajin] Fixed a Scala style error. a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala 951fd42 [zhang jiajin] Delete Prefixspan.scala 575995f [zhangjiajin] Modified the code according to the review comments. 91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.	2015-07-10 21:11:46 -07:00
jose.cambronero	9c5075775d	[SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs This contribution is my original work and I license it to the project under it's open source license. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #6994 from josepablocam/master and squashes the following commits: bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name 0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md 1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf) 1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests 2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly 7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info e760ebd [jose.cambronero] line length changes to fit style check 3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part 9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs 3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity 992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach. 6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal) 4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below 0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm 16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP `c18dc66` [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request b9cff3a [jose.cambronero] made small changes to pass style check ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite 4da189b [jose.cambronero] added user facing ks test functions c659ea1 [jose.cambronero] created KS test class 13dfe4d [jose.cambronero] created test result class for ks test	2015-07-10 20:55:45 -07:00
Scott Taylor	6e1c7e2798	[SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commands This will allow problems with piped commands to be detected. This will also allow tasks to be retried where errors are rare (such as network problems in piped commands). Author: Scott Taylor <github@megatron.me.uk> Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits: 04ae1d5 [Scott Taylor] Remove spurious empty line 98fa101 [Scott Taylor] fix blank line style error 574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4 0c1e762 [Scott Taylor] Update rdd pipe method for checkCode ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode eb4801c [Scott Taylor] fix fail_condition b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1 a307d13 [Scott Taylor] update rdd tests to test pipe modes 34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe a0c0161 [Scott Taylor] fix generator issue 8a9ef9c [Scott Taylor] make check_return_code an iterator 0486ae3 [Scott Taylor] style fixes 8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock 4153b02 [Scott Taylor] fix list.sort returns None 491d3fc [Scott Taylor] Pass a function handle to assertRaises 3344a21 [Scott Taylor] wrap assertRaises with QuietTest 3ab8c7a [Scott Taylor] remove whitespace for style cc1a73d [Scott Taylor] fix style issues in pipe test 8db4073 [Scott Taylor] Add a test for rdd pipe functions 1b3dc4e [Scott Taylor] fix missing space around operator style 0974f98 [Scott Taylor] add space between words in multiline string 45f4977 [Scott Taylor] fix line too long style error 5745d85 [Scott Taylor] Remove space to fix style f552d49 [Scott Taylor] Catch non-zero exit from pipe commands	2015-07-10 19:29:32 -07:00
Cheng Lian	3363088368	[SPARK-8961] [SQL] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row This is a follow-up of [SPARK-8888] [1], which also aims to optimize writing dynamic partitions. Three more changes can be made here: 1. Using `InternalRow` instead of `Row` in `BaseWriterContainer.outputWriterForRow` 2. Using `Cast` expressions to convert partition columns to strings, so that we can leverage code generation. 3. Replacing the FP-style `zip` and `map` calls with a faster imperative `while` loop. [1]: https://issues.apache.org/jira/browse/SPARK-8888 Author: Cheng Lian <lian@databricks.com> Closes #7331 from liancheng/spark-8961 and squashes the following commits: b5ab9ae [Cheng Lian] Casts Java iterator to Scala iterator explicitly 719e63b [Cheng Lian] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row	2015-07-10 18:15:36 -07:00
Davies Liu	b6fc0adf68	add inline comment for python tests	2015-07-10 17:44:21 -07:00

1 2 3 4 5 ...

11896 commits