ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Davies Liu	6a1c864ab6	[SPARK-12295] [SQL] external spilling for window functions This PR manage the memory used by window functions (buffered rows), also enable external spilling. After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G. Author: Davies Liu <davies@databricks.com> Closes #10605 from davies/unsafe_window.	2016-01-06 23:21:52 -08:00
zzcclp	84e77a15df	[DOC] fix 'spark.memory.offHeap.enabled' default value to false modify 'spark.memory.offHeap.enabled' default value to false Author: zzcclp <xm_zzc@sina.com> Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.	2016-01-06 23:06:21 -08:00
Yin Huai	e5cde7ab11	Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None" This reverts commit `fcd013cf70`. Author: Yin Huai <yhuai@databricks.com> Closes #10632 from yhuai/pythonStyle.	2016-01-06 22:03:31 -08:00
Guillaume Poulin	b673852037	[SPARK-12678][CORE] MapPartitionsRDD clearDependencies MapPartitionsRDD was keeping a reference to `prev` after a call to `clearDependencies` which could lead to memory leak. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes #10623 from gpoulin/map_partition_deps.	2016-01-06 21:34:46 -08:00
jerryshao	174e72ceca	[SPARK-12673][UI] Add missing uri prepending for job description Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot: ![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png) Author: jerryshao <sshao@hortonworks.com> Closes #10618 from jerryshao/SPARK-12673.	2016-01-06 21:28:29 -08:00
Josh Rosen	8e19c7663a	[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code. Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs. For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads. Author: Josh Rosen <joshrosen@databricks.com> Closes #10534 from JoshRosen/remove-ttl-based-cleaning.	2016-01-06 20:50:31 -08:00
Robert Dodier	6b6d02be0d	[SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.	2016-01-06 19:49:10 -08:00
Nong Li	a74d743cc7	[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10589 from nongli/spark-12640.	2016-01-06 19:20:43 -08:00
Sean Owen	ac56cf605b	[SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change Author: Sean Owen <sowen@cloudera.com> Closes #10554 from srowen/SPARK-12604.	2016-01-06 17:17:32 -08:00
Wenchen Fan	917d3fc069	[SPARK-12539][SQL] support writing bucketed table This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example: ``` df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales") ``` When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write. Note that there may be multiply files for one bucket, as the data is distributed. Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway. Limitations: * Can't write bucketed data without hive metastore. * Can't insert bucketed data into existing hive tables. Author: Wenchen Fan <wenchen@databricks.com> Closes #10498 from cloud-fan/bucket-write.	2016-01-06 16:58:10 -08:00
Davies Liu	6f7ba6409a	[SPARK-12681] [SQL] split IdentifiersParser.g into two files To avoid to have a huge Java source (over 64K loc), that can't be compiled. cc hvanhovell Author: Davies Liu <davies@databricks.com> Closes #10624 from davies/split_ident.	2016-01-06 15:54:00 -08:00
Shixiong Zhu	cbaea9591f	Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url." This reverts commit `19e4e9febf`. Will merge #10618 instead.	2016-01-06 13:51:50 -08:00
huangzhaowei	19e4e9febf	[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #10617 from SaintBacchus/SPARK-12672.	2016-01-06 12:48:57 -08:00
Shixiong Zhu	1e6648d62f	[SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to Streaming Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10621 from zsxwing/SPARK-12617-2.	2016-01-06 12:03:01 -08:00
BenFradet	f82ebb1522	[SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10328 from BenFradet/SPARK-12368.	2016-01-06 12:01:05 -08:00
zero323	fcd013cf70	[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9986 from zero323/SPARK-12006.	2016-01-06 11:58:33 -08:00
Herman van Hovell	ea489f14f1	[SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10583 from hvanhovell/SPARK-12575.	2016-01-06 11:16:53 -08:00
Yanbo Liang	3aa3488225	[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9807 from yanboliang/spark-11815.	2016-01-06 10:52:25 -08:00
Yanbo Liang	95eb651633	[SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9931 from yanboliang/SPARK-11945.	2016-01-06 10:50:02 -08:00
Joshi	007da1a9dc	[SPARK-11531][ML] SparseVector error Msg PySpark SparseVector should have "Found duplicate indices" error message Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #9525 from rekhajoshm/SPARK-11531.	2016-01-06 10:48:14 -08:00
Holden Karau	3b29004d24	[SPARK-7675][ML][PYSPARK] sparkml params type conversion From JIRA: Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method. A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available. This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future. Author: Holden Karau <holden@us.ibm.com> Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.	2016-01-06 10:43:03 -08:00
Yash Datta	9061e777fd	[SPARK-11878][SQL] Eliminate distribute by in case group by is present with exactly the same grouping expressi For queries like : select <> from table group by a distribute by a we can eliminate distribute by ; since group by will anyways do a hash partitioning Also applicable when user uses Dataframe API Author: Yash Datta <Yash.Datta@guavus.com> Closes #9858 from saucam/eliminatedistribute.	2016-01-06 10:37:53 -08:00
Kousuke Saruta	94c202c7d2	[SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and GraphKryoRegistrator which are deprecated and no longer used Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala are no longer used so it's time to remove them in Spark 2.0. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10613 from sarutak/SPARK-12665.	2016-01-06 10:19:41 -08:00
QiangCai	5d871ea43e	[SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem. spark jira https://issues.apache.org/jira/browse/SPARK-12340 Author: QiangCai <david.caiq@gmail.com> Closes #10562 from QiangCai/bugfix.	2016-01-06 18:13:07 +09:00
Liang-Chi Hsieh	b2467b3810	[SPARK-12578][SQL] Distinct should not be silently ignored when used in an aggregate function with OVER clause JIRA: https://issues.apache.org/jira/browse/SPARK-12578 Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10557 from viirya/keep-distinct-hivesql.	2016-01-06 00:40:14 -08:00
Yanbo Liang	d1fea41363	[SPARK-12393][SPARKR] Add read.text and write.text for SparkR Add ```read.text``` and ```write.text``` for SparkR. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10348 from yanboliang/spark-12393.	2016-01-06 12:05:41 +05:30
Marcelo Vanzin	b3ba1be3b7	[SPARK-3873][TESTS] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.	2016-01-05 19:07:39 -08:00
Marcelo Vanzin	7a375bb87a	[SPARK-3873][CORE] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10578 from vanzin/SPARK-3873-core.	2016-01-05 19:02:25 -08:00
Davies Liu	70fe6ce52f	[SPARK-12659] fix NPE in UnsafeExternalSorter (used by cartesian product) Cartesian product use UnsafeExternalSorter without comparator to do spilling, it will NPE if spilling happens. This bug also hitted by #10605 cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #10606 from davies/fix_spilling.	2016-01-05 18:46:52 -08:00
sureshthalamati	0d42292f6a	[SPARK-12504][SQL] Masking credentials in the sql plan explain output for JDBC data sources. This fix masks JDBC credentials in the explain output. URL patterns to specify credential seems to be vary between different databases. Added a new method to dialect to mask the credentials according to the database specific URL pattern. While adding tests I noticed explain output includes array variable for partitions ([Lorg.apache.spark.Partition;3ff74546,). Modified the code to include the first, and last partition information. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #10452 from sureshthalamati/mask_jdbc_credentials_spark-12504.	2016-01-05 17:48:05 -08:00
Marcelo Vanzin	df8bd97520	[SPARK-3873][SQL] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10573 from vanzin/SPARK-3873-sql.	2016-01-05 16:48:59 -08:00
Kai Jiang	1537e55604	[SPARK-12041][ML][PYSPARK] Add columnSimilarities to IndexedRowMatrix Add `columnSimilarities` to IndexedRowMatrix for PySpark spark.mllib.linalg. Author: Kai Jiang <jiangkai@gmail.com> Closes #10158 from vectorijk/spark-12041.	2016-01-05 15:33:27 -08:00
BrianLondon	ff89975543	[SPARK-12453][STREAMING] Remove explicit dependency on aws-java-sdk Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches. For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: `075c22e89b` The demo ran successfully on the 1.5 branch as well. According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2. Author: BrianLondon <brian@seatgeek.com> Closes #10492 from BrianLondon/remove-only.	2016-01-05 23:15:07 +00:00
RJ Nowling	78015a8b7c	[SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeans SPARK-12450 . Un-persist broadcasted variables in KMeans. Author: RJ Nowling <rnowling@gmail.com> Closes #10415 from rnowling/spark-12450.	2016-01-05 15:05:04 -08:00
Yanbo Liang	1c6cf1a563	[SPARK-12570][ML][DOC] DecisionTreeRegressor: provide variance of prediction: user guide update Update user guide doc for ```DecisionTreeRegressor``` providing variance of prediction. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10594 from yanboliang/spark-12570.	2016-01-05 14:24:32 -08:00
Shixiong Zhu	6cfe341ee8	[SPARK-12511] [PYSPARK] [STREAMING] Make sure PythonDStream.registerSerializer is called only once There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (https://github.com/bartdag/py4j/pull/184) Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10514 from zsxwing/SPARK-12511.	2016-01-05 13:48:47 -08:00
Nong	c26d174265	[SPARK-12636] [SQL] Update UnsafeRowParquetRecordReader to support reading files directly. As noted in the code, this change is to make this component easier to test in isolation. Author: Nong <nongli@gmail.com> Closes #10581 from nongli/spark-12636.	2016-01-05 13:47:24 -08:00
Yanbo Liang	13a3b636d9	[SPARK-6724][MLLIB] Support model save/load for FPGrowthModel Support model save/load for FPGrowthModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #9267 from yanboliang/spark-6724.	2016-01-05 13:31:59 -08:00
Shixiong Zhu	047a31bb10	[SPARK-12617] [PYSPARK] Clean up the leak sockets of Py4J This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue https://github.com/bartdag/py4j/issues/187 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10579 from zsxwing/SPARK-12617.	2016-01-05 13:10:46 -08:00
Liang-Chi Hsieh	d202ad2fc2	[SPARK-12439][SQL] Fix toCatalystArray and MapObjects JIRA: https://issues.apache.org/jira/browse/SPARK-12439 In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type. There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null). Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10391 from viirya/fix-catalystarray.	2016-01-05 12:33:21 -08:00
Reynold Xin	8ce645d4ee	[SPARK-12615] Remove some deprecated APIs in RDD/SparkContext I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List). Author: Reynold Xin <rxin@databricks.com> Closes #10569 from rxin/SPARK-12615.	2016-01-05 11:10:14 -08:00
Wenchen Fan	76768337be	[SPARK-12480][FOLLOW-UP] use a single column vararg for hash address comments in #10435 This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty. Author: Wenchen Fan <wenchen@databricks.com> Closes #10588 from cloud-fan/hash.	2016-01-05 10:23:36 -08:00
Liang-Chi Hsieh	9a6ba7e2c5	[SPARK-12643][BUILD] Set lib directory for antlr JIRA: https://issues.apache.org/jira/browse/SPARK-12643 Without setting lib directory for antlr, the updates of imported grammar files can not be detected. So SparkSqlParser.g will not be rebuilt automatically. Since it is a minor update, no JIRA ticket is opened. Let me know if it is needed. Thanks. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10571 from viirya/antlr-build.	2016-01-05 10:21:47 -08:00
Liang-Chi Hsieh	b3c48e39f4	[SPARK-12438][SQL] Add SQLUserDefinedType support for encoder JIRA: https://issues.apache.org/jira/browse/SPARK-12438 ScalaReflection lacks the support of SQLUserDefinedType. We should add it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10390 from viirya/encoder-udt.	2016-01-05 10:19:56 -08:00
Imran Younus	1cdc42d2b9	[SPARK-12331][ML] R^2 for regression through the origin. Modified the definition of R^2 for regression through origin. Added modified test for regression metrics. Author: Imran Younus <iyounus@us.ibm.com> Author: Imran Younus <imranyounus@gmail.com> Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.	2016-01-05 11:48:45 +00:00
Kousuke Saruta	8eb2dc7133	[SPARK-12641] Remove unused code related to Hadoop 0.23 Currently we don't support Hadoop 0.23 but there is a few code related to it so let's clean it up. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10590 from sarutak/SPARK-12641.	2016-01-05 00:39:50 -08:00
Michael Armbrust	53beddc5bf	[SPARK-12568][SQL] Add BINARY to Encoders Author: Michael Armbrust <michael@databricks.com> Closes #10516 from marmbrus/datasetCleanup.	2016-01-04 23:23:41 -08:00
Marcelo Vanzin	7058dc1150	[SPARK-3873][EXAMPLES] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10575 from vanzin/SPARK-3873-examples.	2016-01-04 22:42:54 -08:00
felixcheung	cc4d5229c9	[SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API rxin davies shivaram Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559 - [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed) Author: felixcheung <felixcheung_m@hotmail.com> Closes #10584 from felixcheung/rremovedeprecated.	2016-01-04 22:32:07 -08:00
Reynold Xin	b634901bb2	[SPARK-12600][SQL] follow up: add range check for DecimalType This addresses davies' code review feedback in https://github.com/apache/spark/pull/10559 Author: Reynold Xin <rxin@databricks.com> Closes #10586 from rxin/remove-deprecated-sql-followup.	2016-01-04 21:05:27 -08:00

... 3 4 5 6 7 ...

14525 commits