ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
wangfei	c3d91da5ea	[SPARK-4861][SQL] Refactory command in spark sql Remove ```Command``` and use ```RunnableCommand``` instead. Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #3712 from scwf/cmd and squashes the following commits: 51a82f2 [wangfei] fix test failure 0e03be8 [wangfei] address comments 4033bed [scwf] remove CreateTableAsSelect in hivestrategy 5d20010 [wangfei] address comments 125f542 [scwf] factory command in spark sql	2014-12-18 20:24:56 -08:00
Thu Kyaw	b68bc6d264	[SPARK-3928][SQL] Support wildcard matches on Parquet files. ...arquetFile accept hadoop glob pattern in path. Author: Thu Kyaw <trk007@gmail.com> Closes #3407 from tkyaw/master and squashes the following commits: 19115ad [Thu Kyaw] Merge https://github.com/apache/spark ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files. d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files. ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.	2014-12-18 20:08:32 -08:00
Cheng Hao	f728e0fe7e	[SPARK-2663] [SQL] Support the Grouping Set Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`. More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf The generic idea of the implementations are : 1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS` 2 Explode each of the input row, and then feed them to `Aggregate` * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`) * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask) * Output Schema of `Explode` is `child.output :+ grouping__id` * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id` * Keep the `Aggregation expressions` the same for the `Aggregate` The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections. A known issue will be done in the follow up PR: * Optimization `ColumnPruning` is not supported yet for `Explosive` node. Author: Cheng Hao <hao.cheng@intel.com> Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits: fe65fcc [Cheng Hao] Remove the extra space 3547056 [Cheng Hao] Add more doc and Simplify the Expand a7c869d [Cheng Hao] update code as feedbacks d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression] 414b165 [Cheng Hao] revert the unnecessary changes ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets	2014-12-18 18:58:29 -08:00
Cheng Hao	8d0d2a65eb	[SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul... ``` TestSQLContext.sparkContext.parallelize( """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" :: """{"ip":"27.31.100.29","headers":{}}""" :: """{"ip":"27.31.100.29","headers":""}""" :: Nil) ``` As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected. Author: Cheng Hao <hao.cheng@intel.com> Closes #3708 from chenghao-intel/json and squashes the following commits: e7a72e9 [Cheng Hao] add more concise unit test 853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value	2014-12-17 15:01:59 -08:00
Cheng Lian	6277135376	[SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`. However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do NOT think we need to backport this to branch-1.1 This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3367 from liancheng/filters-with-null and squashes the following commits: cc41281 [Cheng Lian] Fixes several styling issues de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null	2014-12-17 12:48:04 -08:00
Cheng Hao	5fdcbdc0c9	[SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser Add `sort by` support for both DSL & SqlParser. This PR is relevant with #3386, either one merged, will cause the other rebased. Author: Cheng Hao <hao.cheng@intel.com> Closes #3481 from chenghao-intel/sortby and squashes the following commits: 041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser	2014-12-17 12:01:57 -08:00
scwf	60698801eb	[SPARK-4618][SQL] Make foreign DDL commands options case-insensitive Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters. So flowing cmd work ``` create temporary table normal_parquet USING org.apache.spark.sql.parquet OPTIONS ( PATH '/xxx/data' ) ``` Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei1@huawei.com> Closes #3470 from scwf/ddl-ulcase and squashes the following commits: ae78509 [scwf] address comments 8f4f585 [wangfei] address comments 3c132ef [scwf] minor fix a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase 4f86401 [scwf] adding CaseInsensitiveMap e244e8d [wangfei] using lower case in json e0cb017 [wangfei] make options in-casesensitive	2014-12-16 21:26:36 -08:00
Davies Liu	ec5c4279ed	[SPARK-4866] support StructType as key in MapType This PR brings support of using StructType(and other hashable types) as key in MapType. Author: Davies Liu <davies@databricks.com> Closes #3714 from davies/fix_struct_in_map and squashes the following commits: 68585d7 [Davies Liu] fix primitive types in MapType 9601534 [Davies Liu] support StructType as key in MapType	2014-12-16 21:23:28 -08:00
Cheng Hao	770d8153a5	[SPARK-4375] [SQL] Add 0 argument support for udf Author: Cheng Hao <hao.cheng@intel.com> Closes #3595 from chenghao-intel/udf0 and squashes the following commits: a858973 [Cheng Hao] Add 0 arguments support for udf	2014-12-16 21:21:11 -08:00
Cheng Lian	3b395e1051	[SPARK-4798][SQL] A new set of Parquet testing API and test suites This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled: - `ParquetQuerySuite` - `ParquetTestData` <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3644 from liancheng/parquet-tests and squashes the following commits: 800e745 [Cheng Lian] Enforces ordering of test output 3bb8731 [Cheng Lian] Refactors HiveParquetSuite aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext 7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc 7f07af0 [Cheng Lian] Adds a new set of Parquet test suites	2014-12-16 21:16:03 -08:00
Jacky Li	fa66ef6c97	[SPARK-4269][SQL] make wait time configurable in BroadcastHashJoin In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table. In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment. Author: Jacky Li <jacky.likun@huawei.com> Closes #3133 from jackylk/timeout-config and squashes the following commits: 733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala 557acd4 [Jacky Li] switch to sqlContext.getConf 81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin	2014-12-16 15:34:59 -08:00
tianyi	30f6b85c81	[SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one. Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue. for table test_csv contains 1 million records table dim_csv contains 10 thousand records SQL: `select * from test_csv a left outer join dim_csv b on a.key = b.key` the result is: master: ``` CSV: 12671 ms CSV: 9021 ms CSV: 9200 ms Current Mem Usage:787788984 ``` after patch： ``` CSV: 10382 ms CSV: 7543 ms CSV: 7469 ms Current Mem Usage:208145728 ``` Author: tianyi <tianyi@asiainfo-linkage.com> Author: tianyi <tianyi.asiainfo@gmail.com> Closes #3375 from tianyi/SPARK-4483 and squashes the following commits: 72a8aec [tianyi] avoid having mutable state stored inside of the task 99c5c97 [tianyi] performance optimization d2f94d7 [tianyi] fix bug: missing output when the join-key is null. 2be45d1 [tianyi] fix spell bug 1f2c6f1 [tianyi] remove commented codes a676de6 [tianyi] optimize some codes 9e7d5b5 [tianyi] remove commented old codes 838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin	2014-12-16 15:22:29 -08:00
zsxwing	6530243a52	[SPARK-4812][SQL] Fix the initialization issue of 'codegenEnabled' The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue. ```Scala scala> :paste // Entering paste mode (ctrl-D to finish) abstract class Foo { protected val sqlContext = "Foo" val codegenEnabled: Boolean = { println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized. if (sqlContext != null) { true } else { false } } } class Bar extends Foo { override val sqlContext = "Bar" } println(new Bar().codegenEnabled) // Exiting paste mode, now interpreting. null false defined class Foo defined class Bar ``` We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly. Author: zsxwing <zsxwing@gmail.com> Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits: 1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly	2014-12-16 14:13:40 -08:00
jerryshao	dc8280dcca	[SPARK-4847][SQL]Fix "extraStrategies cannot take effect in SQLContext" issue Author: jerryshao <saisai.shao@intel.com> Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits: 4741130 [jerryshao] Make later added extraStrategies effect when calling strategies	2014-12-16 14:08:28 -08:00
Sasaki Toru	8091dd62ea	[SPARK-4742][SQL] The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes #3602 from sasakitoa/parquet-zeroPadding and squashes the following commits: 6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding 20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat	2014-12-11 22:54:21 -08:00
Cheng Hao	bf40cf89e3	[SPARK-4713] [SQL] SchemaRDD.unpersist() should not raise exception if it is not persisted Unpersist a uncached RDD, will not raise exception, for example: ``` val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) distData.unpersist(true) ``` But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior. Author: Cheng Hao <hao.cheng@intel.com> Closes #3572 from chenghao-intel/try_uncache and squashes the following commits: 50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted	2014-12-11 22:41:36 -08:00
Jacky Li	944384363d	[SQL] remove unnecessary import in spark-sql Author: Jacky Li <jacky.likun@huawei.com> Closes #3630 from jackylk/remove and squashes the following commits: 150e7e0 [Jacky Li] remove unnecessary import	2014-12-08 17:27:46 -08:00
Michael Armbrust	f5801e813f	[SPARK-4753][SQL] Use catalyst for partition pruning in newParquet. Author: Michael Armbrust <michael@databricks.com> Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits: 4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.	2014-12-04 22:25:21 -08:00
Michael Armbrust	513ef82e85	[SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust <michael@databricks.com> Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive	2014-12-03 14:13:35 -08:00
YanTangZhai	1066427600	[SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null val jsc = new org.apache.spark.api.java.JavaSparkContext(sc) val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc) val nrdd = jhc.hql("select null from spark_test.for_test") println(nrdd.schema) Then the error is thrown as follows: scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$) at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43) Author: YanTangZhai <hakeemzhai@tencent.com> Author: yantangzhai <tyz0303@163.com> Author: Michael Armbrust <michael@databricks.com> Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits: e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null 6e643f8 [YanTangZhai] Merge pull request #11 from apache/master e249846 [YanTangZhai] Merge pull request #10 from apache/master d26d982 [YanTangZhai] Merge pull request #9 from apache/master 76d4027 [YanTangZhai] Merge pull request #8 from apache/master 03b62b0 [YanTangZhai] Merge pull request #7 from apache/master 8a00106 [YanTangZhai] Merge pull request #6 from apache/master cbcba66 [YanTangZhai] Merge pull request #3 from apache/master cdef539 [YanTangZhai] Merge pull request #1 from apache/master	2014-12-02 14:15:12 -08:00
baishuo	69b6fed206	[SPARK-4663][sql]add finally to avoid resource leak Author: baishuo <vc_java@hotmail.com> Closes #3526 from baishuo/master-trycatch and squashes the following commits: d446e14 [baishuo] correct the code style b36bf96 [baishuo] correct the code style ae0e447 [baishuo] add finally to avoid resource leak	2014-12-02 12:12:03 -08:00
Reynold Xin	b1f8fe316a	Indent license header properly for interfaces.scala. A very small nit update. Author: Reynold Xin <rxin@databricks.com> Closes #3552 from rxin/license-header and squashes the following commits: df8d1a4 [Reynold Xin] Indent license header properly for interfaces.scala.	2014-12-02 11:59:15 -08:00
wangfei	7b79957879	[SQL] Minor fix for doc and comment Author: wangfei <wangfei1@huawei.com> Closes #3533 from scwf/sql-doc1 and squashes the following commits: 962910b [wangfei] doc and comment fix	2014-12-01 14:02:02 -08:00
ravipesala	bc353819cc	[SPARK-4658][SQL] Code documentation issue in DDL of datasource API Author: ravipesala <ravindra.pesala@huawei.com> Closes #3516 from ravipesala/ddl_doc and squashes the following commits: d101fdf [ravipesala] Style issues fixed d2238cd [ravipesala] Corrected documentation	2014-12-01 13:31:27 -08:00
Jacky Li	bafee67eba	[SQL] add @group tab in limit() and count() group tab is missing for scaladoc Author: Jacky Li <jacky.likun@gmail.com> Closes #3458 from jackylk/patch-7 and squashes the following commits: 0121a70 [Jacky Li] add @group tab in limit() and count()	2014-12-01 13:12:30 -08:00
Davies Liu	6cf507685e	[SPARK-4548] []SPARK-4517] improve performance of python broadcast Re-implement the Python broadcast using file: 1) serialize the python object using cPickle, write into disks. 2) Create a wrapper in JVM (for the dumped file), it read data from during serialization 3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors 4) During deserialization, writing the data into disk. 5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access. It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor). Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2): name \| 1.1 \| 1.2 with this patch \| improvement ---------\|--------\|---------\|-------- python-broadcast-w-bytes \| 25.20 \| 9.33 \| 170.13% \| python-broadcast-w-set \| 4.13 \| 4.50 \| -8.35% \| Testing with 100 tasks (16 CPUs): name \| 1.1 \| 1.2 with this patch \| improvement ---------\|--------\|---------\|-------- python-broadcast-w-bytes \| 38.16 \| 8.40 \| 353.98% python-broadcast-w-set \| 23.29 \| 9.59 \| 142.80% Author: Davies Liu <davies@databricks.com> Closes #3417 from davies/pybroadcast and squashes the following commits: 50a58e0 [Davies Liu] address comments b98de1d [Davies Liu] disable gc while unpickle e5ee6b9 [Davies Liu] support large string 09303b8 [Davies Liu] read all data into memory dde02dd [Davies Liu] improve performance of python broadcast	2014-11-24 17:17:03 -08:00
Cheng Lian	a6d7b61f92	[SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`, 1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and 2. avoids defensive copies in `Exchange` operator <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits: 591f2e9 [Cheng Lian] Passes all shuffle suites 0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed ed5df3c [Cheng Lian] Fixes styling changes f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on	2014-11-24 12:43:45 -08:00
Michael Armbrust	02ec058efe	[SPARK-4413][SQL] Parquet support through datasource API Goals: - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns) - Support for folder based partitioning with automatic discovery of available partitions - Caching of file metadata See scaladoc of `ParquetRelation2` for more details. Author: Michael Armbrust <michael@databricks.com> Closes #3269 from marmbrus/newParquet and squashes the following commits: 1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once. 645768b [Michael Armbrust] Review comments. abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API. 938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions. e9d2641 [Michael Armbrust] logging / formatting improvements.	2014-11-20 18:31:02 -08:00
Jacky Li	ad5f1f3ca2	[SQL] fix function description mistake Sample code in the description of SchemaRDD.where is not correct Author: Jacky Li <jacky.likun@gmail.com> Closes #3344 from jackylk/patch-6 and squashes the following commits: 62cd126 [Jacky Li] [SQL] fix function description mistake	2014-11-20 15:48:36 -08:00
Takuya UESHIN	2c2e7a44db	[SPARK-4318][SQL] Fix empty sum distinct. Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits: 8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318 66fdb0a [Takuya UESHIN] Re-refine aggregate functions. 6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate. d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate. 1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions. 917e533 [Takuya UESHIN] Use aggregate instead of groupBy(). 1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation. a5a57d2 [Takuya UESHIN] Fix empty Average. 22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct. 65b7dd2 [Takuya UESHIN] Fix empty sum distinct.	2014-11-20 15:41:24 -08:00
Dan McClary	b8e6886fb8	[SPARK-4228][SQL] SchemaRDD to JSON Here's a simple fix for SchemaRDD to JSON. Author: Dan McClary <dan.mcclary@gmail.com> Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits: d714e1d [Dan McClary] fixed PEP 8 error cac2879 [Dan McClary] move pyspark comment and doctest to correct location f9471d3 [Dan McClary] added pyspark doc and doctest 6598cee [Dan McClary] adding complex type queries 1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite 4a651f0 [Dan McClary] cleaned PEP and Scala style failures. Moved tests to JsonSuite 47ceff6 [Dan McClary] cleaned up scala style issues 2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD 4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting 8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON 1b11980 [Dan McClary] Map and UserDefinedTypes partially done 11d2016 [Dan McClary] formatting and unicode deserialization default fixed 6af72d1 [Dan McClary] deleted extaneous comment 4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD 149dafd [Dan McClary] wrapped scala toJSON in sql.py 5e5eb1b [Dan McClary] switched to Jackson for JSON processing 6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD 1d171aa [Dan McClary] upated missing brace on if statement 319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228 424f130 [Dan McClary] tests pass, ready for pull and PR 626a5b1 [Dan McClary] added toJSON to SchemaRDD f7d166a [Dan McClary] added toJSON method 5d34e37 [Dan McClary] merge resolved d6d19e9 [Dan McClary] pr example	2014-11-20 13:44:19 -08:00
Cheng Lian	abf29187f0	[SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD. Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example: ```sql CACHE TABLE first AS SELECT * FROM src; CACHE TABLE second AS SELECT * FROM src; ``` The Web UI only shows "In-memory table first". <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits: 071907f [Cheng Lian] Fixes tests 12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name	2014-11-20 13:12:24 -08:00
Cheng Lian	423baea953	[SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation	2014-11-18 17:41:54 -08:00
Davies Liu	4a377aff2d	[SPARK-3721] [PySpark] broadcast objects larger than 2G This patch will bring support for broadcasting objects larger than 2G. pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]]. Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf. Author: Davies Liu <davies@databricks.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2659 from davies/huge and squashes the following commits: 7b57a14 [Davies Liu] add more tests for broadcast 28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge a2f6a02 [Davies Liu] bug fix 4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 5875c73 [Davies Liu] address comments 10a349b [Davies Liu] address comments 0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 6182c8f [Davies Liu] Merge branch 'master' into huge d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 2514848 [Davies Liu] address comments fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 1c2d928 [Davies Liu] fix scala style 091b107 [Davies Liu] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
Cheng Lian	36b0956a3e	[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](`64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (L213-L228)`)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic	2014-11-17 16:55:12 -08:00
Cheng Lian	5ce7dae859	[SQL] Makes conjunction pushdown more aggressive for in-memory table This is inspired by the [Parquet record filter generation code](`64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala (L387-L400)`). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits: 78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive	2014-11-17 15:33:13 -08:00
Michael Armbrust	64c6b9bad5	[SPARK-4410][SQL] Add support for external sort Adds a new operator that uses Spark's `ExternalSort` class. It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance. Author: Michael Armbrust <michael@databricks.com> Closes #3268 from marmbrus/externalSort and squashes the following commits: 48b9726 [Michael Armbrust] comments b98799d [Michael Armbrust] Add test afd7562 [Michael Armbrust] Add support for external sort.	2014-11-16 21:55:57 -08:00
Jim Carroll	37482ce5a7	[SPARK-4412][SQL] Fix Spark's control of Parquet logging. The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. Author: Jim Carroll <jim@dontcallme.com> Closes #3271 from jimfcarroll/parquet-logging and squashes the following commits: 37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging.	2014-11-14 15:33:21 -08:00
Yash Datta	63ca3af66f	[SPARK-4365][SQL] Remove unnecessary filter call on records returned from parquet library Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current >= total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug("skipping record"); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug("filtered record reader reached end of block"); continue; } recordFound = true; if (DEBUG) LOG.debug("read value: " + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); } } return true; } Author: Yash Datta <Yash.Datta@guavus.com> Closes #3229 from saucam/remove_filter and squashes the following commits: 8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library	2014-11-14 15:16:40 -08:00
Jim Carroll	f76b968370	[SPARK-4386] Improve performance when writing Parquet files. If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?"). This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'. Author: Jim Carroll <jim@dontcallme.com> Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits: 30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.	2014-11-14 15:11:53 -08:00
Michael Armbrust	4b4b50c9e5	[SQL] Don't shuffle code generated rows When sort based shuffle and code gen are on we were trying to ship the code generated rows during a shuffle. This doesn't work because the classes don't exist on the other side. Instead we now copy into a generic row before shipping. Author: Michael Armbrust <michael@databricks.com> Closes #3263 from marmbrus/aggCodeGen and squashes the following commits: f6ba8cf [Michael Armbrust] fix and test	2014-11-14 15:03:23 -08:00
Michael Armbrust	e47c387639	[SPARK-4391][SQL] Configure parquet filters using SQLConf This is more uniform with the rest of SQL configuration and allows it to be turned on and off without restarting the SparkContext. In this PR I also turn off filter pushdown by default due to a number of outstanding issues (in particular SPARK-4258). When those are fixed we should turn it back on by default. Author: Michael Armbrust <michael@databricks.com> Closes #3258 from marmbrus/parquetFilters and squashes the following commits: 5655bfe [Michael Armbrust] Remove extra line. 15e9a98 [Michael Armbrust] Enable filters for tests 75afd39 [Michael Armbrust] Fix comments 78fa02d [Michael Armbrust] off by default e7f9e16 [Michael Armbrust] First draft of correctly configuring parquet filter pushdown	2014-11-14 14:59:35 -08:00
Michael Armbrust	77e845ca77	[SPARK-4394][SQL] Data Sources API Improvements This PR adds two features to the data sources API: - Support for pushing down `IN` filters - The ability for relations to optionally provide information about their `sizeInBytes`. Author: Michael Armbrust <michael@databricks.com> Closes #3260 from marmbrus/sourcesImprovements and squashes the following commits: 9a5e171 [Michael Armbrust] Use method instead of configuration directly 99c0e6b [Michael Armbrust] Add support for sizeInBytes. 416f167 [Michael Armbrust] Support for IN in data sources API. 2a04ab3 [Michael Armbrust] Simplify implementation of InSet.	2014-11-14 12:00:08 -08:00
Cheng Hao	c764d0ac1c	[SPARK-4274] [SQL] Fix NPE in printing the details of the query plan Author: Cheng Hao <hao.cheng@intel.com> Closes #3139 from chenghao-intel/comparison_test and squashes the following commits: f5d7146 [Cheng Hao] avoid exception in printing the codegen enabled	2014-11-10 17:46:05 -08:00
Daoyuan Wang	a1fc059b69	[SPARK-4149][SQL] ISO 8601 support for json date time strings This implement the feature davies mentioned in https://github.com/apache/spark/pull/2901#discussion-diff-19313312 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3012 from adrian-wang/iso8601 and squashes the following commits: 50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support	2014-11-10 17:26:03 -08:00
Xiangrui Meng	d793d80c80	[SQL] remove a decimal case branch that has no effect at runtime it generates warnings at compile time marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3192 from mengxr/dtc-decimal and squashes the following commits: 955e9fb [Xiangrui Meng] remove a decimal case branch that has no effect	2014-11-10 17:20:52 -08:00
Sean Owen	f8e5732307	SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619 I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage. So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away. This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems. Author: Sean Owen <sowen@cloudera.com> Closes #3048 from srowen/SPARK-1209 and squashes the following commits: 0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions 466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though? eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark	2014-11-09 22:11:20 -08:00
Kousuke Saruta	14c54f1876	[SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators Following description is quoted from JIRA: When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1; insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf("spark.sql.shuffle.partitions", "10") hc.setConf("spark.sql.hive.convertMetastoreParquet", "true") hc.setConf("spark.sql.parquet.compression.codec", "snappy") import hc._ hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'") A scala.MatchError will appear in the output. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3083 from sarutak/SPARK-4213 and squashes the following commits: 4ab6e56 [Kousuke Saruta] WIP b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213 9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings	2014-11-07 11:56:40 -08:00
Xiangrui Meng	3d2b5bc5bb	[SPARK-4262][SQL] add .schemaRDD to JavaSchemaRDD marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3125 from mengxr/SPARK-4262 and squashes the following commits: 307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD	2014-11-05 19:56:16 -08:00
Davies Liu	e4f42631a6	[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. Author: Davies Liu <davies@databricks.com> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #2920 from davies/fix_autobatch and squashes the following commits: e544ef9 [Davies Liu] revert unrelated change 6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 1d557fc [Davies Liu] fix tests 8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 76abdce [Davies Liu] clean up 53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch b4292ce [Davies Liu] fix bug in master d79744c [Davies Liu] recover hive tests be37ece [Davies Liu] refactor eb3938d [Davies Liu] refactor serializer in scala 8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.	2014-11-03 23:56:14 -08:00

1 2 3 4 5 ...

296 commits