ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Nong Li	21112e8a14	[SPARK-12992] [SQL] Update parquet reader to support more types when decoding to ColumnarBatch. This patch implements support for more types when doing the vectorized decode. There are a few more types remaining but they should be very straightforward after this. This code has a few copy and paste pieces but they are difficult to eliminate due to performance considerations. Specifically, this patch adds support for: - String, Long, Byte types - Dictionary encoding for those types. Author: Nong Li <nong@databricks.com> Closes #10908 from nongli/spark-12992.	2016-02-02 16:33:21 -08:00
Wenchen Fan	672032d0ab	[SPARK-13020][SQL][TEST] fix random generator for map type when we generate map, we first randomly pick a length, then create a seq of key value pair with the expected length, and finally call `toMap`. However, `toMap` will remove all duplicated keys, which makes the actual map size much less than we expected. This PR fixes this problem by put keys in a set first, to guarantee we have enough keys to build a map with expected length. Author: Wenchen Fan <wenchen@databricks.com> Closes #10930 from cloud-fan/random-generator.	2016-02-03 08:26:35 +08:00
Davies Liu	6de6a97728	[SPARK-13150] [SQL] disable two flaky tests Author: Davies Liu <davies@databricks.com> Closes #11037 from davies/disable_flaky.	2016-02-02 16:24:31 -08:00
Kevin (Sangwoo) Kim	b377b03531	[DOCS] Update StructType.scala The example will throw error like <console>:20: error: not found: value StructType Need to add this line: import org.apache.spark.sql.types._ Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes #10141 from swkimme/patch-1.	2016-02-02 13:24:21 -08:00
Davies Liu	be5dd881f1	[SPARK-12913] [SQL] Improve performance of stat functions As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one. Author: Davies Liu <davies@databricks.com> Closes #10960 from davies/stddev.	2016-02-02 11:50:14 -08:00
Reynold Xin	7f6e3ec79b	[SPARK-13138][SQL] Add "logical" package prefix for ddl.scala ddl.scala is defined in the execution package, and yet its reference of "UnaryNode" and "Command" are logical. This was fairly confusing when I was trying to understand the ddl code. Author: Reynold Xin <rxin@databricks.com> Closes #11021 from rxin/SPARK-13138.	2016-02-02 11:29:20 -08:00
Daoyuan Wang	358300c795	[SPARK-13056][SQL] map column would throw NPE if value is null Jira: https://issues.apache.org/jira/browse/SPARK-13056 Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10964 from adrian-wang/npewriter.	2016-02-02 11:09:40 -08:00
hyukjinkwon	b93830126c	[SPARK-13114][SQL] Add a test for tokens more than the fields in schema https://issues.apache.org/jira/browse/SPARK-13114 This PR adds a test for tokens more than the fields in schema. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11020 from HyukjinKwon/SPARK-13114.	2016-02-02 10:41:06 -08:00
Michael Armbrust	29d92181d0	[SPARK-13094][SQL] Add encoders for seq/array of primitives Author: Michael Armbrust <michael@databricks.com> Closes #11014 from marmbrus/seqEncoders.	2016-02-02 10:15:40 -08:00
Michael Armbrust	12a20c144f	[SPARK-10820][SQL] Support for the continuous execution of structured queries This is a follow up to `9aadcffabd` that extends Spark SQL to allow users to _repeatedly_ optimize and execute structured queries. A `ContinuousQuery` can be expressed using SQL, DataFrames or Datasets. The purpose of this PR is only to add some initial infrastructure which will be extended in subsequent PRs. ## User-facing API - `sqlContext.streamFrom` and `df.streamTo` return builder objects that are analogous to the `read/write` interfaces already available to executing queries in a batch-oriented fashion. - `ContinuousQuery` provides an interface for interacting with a query that is currently executing in the background. ## Internal Interfaces - `StreamExecution` - executes streaming queries in micro-batches The following are currently internal, but public APIs will be provided in a future release. - `Source` - an interface for providers of continually arriving data. A source must have a notion of an `Offset` that monotonically tracks what data has arrived. For fault tolerance, a source must be able to replay data given a start offset. - `Sink` - an interface that accepts the results of a continuously executing query. Also responsible for tracking the offset that should be resumed from in the case of a failure. ## Testing - `MemoryStream` and `MemorySink` - simple implementations of source and sink that keep all data in memory and have methods for simulating durability failures - `StreamTest` - a framework for performing actions and checking invariants on a continuous query Author: Michael Armbrust <michael@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Josh Rosen <rosenville@gmail.com> Closes #11006 from marmbrus/structured-streaming.	2016-02-02 10:13:54 -08:00
Michael Armbrust	22ba21348b	[SPARK-13087][SQL] Fix group by function for sort based aggregation It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`. The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time. Author: Michael Armbrust <michael@databricks.com> Closes #11013 from marmbrus/groupByFunction-master.	2016-02-02 16:48:59 +08:00
Reynold Xin	0fff5c6e63	[SPARK-13130][SQL] Make codegen variable names easier to read 1. Use lower case 2. Change long prefixes to something shorter (in this case I am changing only one: TungstenAggregate -> agg). Author: Reynold Xin <rxin@databricks.com> Closes #11017 from rxin/SPARK-13130.	2016-02-01 23:08:11 -08:00
Reynold Xin	be7a2fc071	[SPARK-13078][SQL] API and test cases for internal catalog This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper). I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality. Author: Reynold Xin <rxin@databricks.com> Closes #10982 from rxin/SPARK-13078.	2016-02-01 14:11:52 -08:00
Jacek Laskowski	a2973fed30	Fix for [SPARK-12854][SQL] Implement complex types support in Columna… …rBatch Fixes build for Scala 2.11. Author: Jacek Laskowski <jacek@japila.pl> Closes #10946 from jaceklaskowski/SPARK-12854-fix.	2016-02-01 13:57:48 -08:00
Nong Li	064b029c6a	[SPARK-13043][SQL] Implement remaining catalyst types in ColumnarBatch. This includes: float, boolean, short, decimal and calendar interval. Decimal is mapped to long or byte array depending on the size and calendar interval is mapped to a struct of int and long. The only remaining type is map. The schema mapping is straightforward but we might want to revisit how we deal with this in the rest of the execution engine. Author: Nong Li <nong@databricks.com> Closes #10961 from nongli/spark-13043.	2016-02-01 13:56:14 -08:00
gatorsmile	8f26eb5ef6	[SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferences JIRA: https://issues.apache.org/jira/browse/SPARK-12705 Scope: This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, missing attributes): - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`. - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it. General Reference Resolution Rules: - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children. - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed. - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries. - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`. Implementation: 1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes. 2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes. 3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node. Risk: Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected. Author: gatorsmile <gatorsmile@gmail.com> Closes #10678 from gatorsmile/sortWindows.	2016-02-01 11:57:13 -08:00
gatorsmile	33c8a490f7	[SPARK-12989][SQL] Delaying Alias Cleanup after ExtractWindowExpressions JIRA: https://issues.apache.org/jira/browse/SPARK-12989 In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case: ```scala val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")) .drop("A") .drop("B") .drop("C") val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc) data.select($"*", max("num").over(winSpec) as "max").explain(true) ``` In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10963 from gatorsmile/seletStarAfterColDrop.	2016-02-01 11:22:02 -08:00
Wenchen Fan	c1da4d421a	[SPARK-13093] [SQL] improve null check in nullSafeCodeGen for unary, binary and ternary expression The current implementation is sub-optimal: * If an expression is always nullable, e.g. `Unhex`, we can still remove null check for children if they are not nullable. * If an expression has some non-nullable children, we can still remove null check for these children and keep null check for others. This PR improves this by making the null check elimination more fine-grained. Author: Wenchen Fan <wenchen@databricks.com> Closes #10987 from cloud-fan/null-check.	2016-01-31 22:43:03 -08:00
Herman van Hovell	5a8b978fab	[SPARK-13049] Add First/last with ignore nulls to functions.scala This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g: ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))``` This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10957 from hvanhovell/SPARK-13049.	2016-01-31 13:56:13 -08:00
Liang-Chi Hsieh	0e6d92d042	[SPARK-12689][SQL] Migrate DDL parsing to the newly absorbed parser JIRA: https://issues.apache.org/jira/browse/SPARK-12689 DDLParser processes three commands: createTable, describeTable and refreshTable. This patch migrates the three commands to newly absorbed parser. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10723 from viirya/migrate-ddl-describe.	2016-01-30 23:05:29 -08:00
Cheng Lian	a1303de0a0	[SPARK-13070][SQL] Better error message when Parquet schema merging fails Make sure we throw better error messages when Parquet schema merging fails. Author: Cheng Lian <lian@databricks.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10979 from viirya/schema-merging-failure-message.	2016-01-30 23:02:49 -08:00
wangyang	de28371998	[SPARK-13100][SQL] improving the performance of stringToDate method in DateTimeUtils.scala In jdk1.7 TimeZone.getTimeZone() is synchronized, so use an instance variable to hold an GMT TimeZone object instead of instantiate it every time. Author: wangyang <wangyang@haizhi.com> Closes #10994 from wangyang1992/datetimeUtil.	2016-01-30 15:20:57 -08:00
Josh Rosen	289373b28c	[SPARK-6363][BUILD] Make Scala 2.11 the default Scala version This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes #10608 from JoshRosen/SPARK-6363.	2016-01-30 00:20:28 -08:00
Wenchen Fan	dab246f7e4	[SPARK-13098] [SQL] remove GenericInternalRowWithSchema This class is only used for serialization of Python DataFrame. However, we don't require internal row there, so `GenericRowWithSchema` can also do the job. Author: Wenchen Fan <wenchen@databricks.com> Closes #10992 from cloud-fan/python.	2016-01-29 23:37:51 -08:00
Davies Liu	e6a02c66d5	[SPARK-12914] [SQL] generate aggregation with grouping keys This PR add support for grouping keys for generated TungstenAggregate. Spilling and performance improvements for BytesToBytesMap will be done by followup PR. Author: Davies Liu <davies@databricks.com> Closes #10855 from davies/gen_keys.	2016-01-29 20:16:11 -08:00
Andrew Or	12252d1da9	[SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g. ``` sc.textFile(..., 4).coalesce(2).count() ``` we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`. For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark. This was caused by #10835. Author: Andrew Or <andrew@databricks.com> Closes #10973 from andrewor14/fix-input-metrics-coalesce.	2016-01-29 18:03:08 -08:00
Reynold Xin	2cbc412821	[SPARK-13076][SQL] Rename ClientInterface -> HiveClient And ClientWrapper -> HiveClientImpl. I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes. Author: Reynold Xin <rxin@databricks.com> Closes #10981 from rxin/SPARK-13076.	2016-01-29 16:57:34 -08:00
Andrew Or	e38b0baa38	[SPARK-13055] SQLHistoryListener throws ClassCastException This is an existing issue uncovered recently by #10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails. The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`. Author: Andrew Or <andrew@databricks.com> Closes #10971 from andrewor14/fix-sql-history.	2016-01-29 13:45:03 -08:00
gatorsmile	5f686cc8b7	[SPARK-12656] [SQL] Implement Intersect with Left-semi Join Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10630 from gatorsmile/IntersectBySemiJoin.	2016-01-29 11:22:12 -08:00
Wenchen Fan	c5f745ede0	[SPARK-13072] [SQL] simplify and improve murmur3 hash expression codegen simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible. generated code comparison for `hash(int, double, string, array<string>)`: before: ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) / int value1 = 42; / input[0, int] / int value3 = i.getInt(0); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); } / input[1, double] / double value5 = i.getDouble(1); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); } / input[2, string] / boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } / input[3, array<int>] / boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { int result10 = value1; for (int index11 = 0; index11 < value9.numElements(); index11++) { if (!value9.isNullAt(index11)) { final int element12 = value9.getInt(index11); result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10); } } value1 = result10; } } ``` after:* ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) / int value1 = 42; / input[0, int] / int value3 = i.getInt(0); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); / input[1, double] / double value5 = i.getDouble(1); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); / input[2, string] / boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } / input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { for (int index10 = 0; index10 < value9.numElements(); index10++) { final int element11 = value9.getInt(index10); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1); } } rowWriter14.write(0, value1); return result12; } ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10974 from cloud-fan/codegen.	2016-01-29 10:24:23 -08:00
Davies Liu	55561e7693	[SPARK-13031][SQL] cleanup codegen and improve test coverage 1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. This PR re-open #10944 and fix the bug. Author: Davies Liu <davies@databricks.com> Closes #10977 from davies/gen_refactor.	2016-01-29 01:59:59 -08:00
Wenchen Fan	721ced28b5	[SPARK-13067] [SQL] workaround for a weird scala reflection problem A simple workaround to avoid getting parameter types when convert a logical plan to json. Author: Wenchen Fan <wenchen@databricks.com> Closes #10970 from cloud-fan/reflection.	2016-01-28 22:43:03 -08:00
Liang-Chi Hsieh	66449b8dcd	[SPARK-12968][SQL] Implement command to set current database JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10916 from viirya/ddl-use-database.	2016-01-28 22:20:52 -08:00
Davies Liu	b9dfdcc63b	Revert "[SPARK-13031] [SQL] cleanup codegen and improve test coverage" This reverts commit `cc18a71992`.	2016-01-28 17:01:12 -08:00
Liang-Chi Hsieh	4637fc08a3	[SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #9940 from viirya/safe-pushdown-parquet-filters.	2016-01-28 16:25:21 -08:00
Brandon Bradley	3a40c0e575	[SPARK-12749][SQL] add json option to parse floating-point types as DecimalType I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10936 from blbradley/spark-12749.	2016-01-28 15:25:57 -08:00
Davies Liu	cc18a71992	[SPARK-13031] [SQL] cleanup codegen and improve test coverage 1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. Author: Davies Liu <davies@databricks.com> Closes #10944 from davies/gen_refactor.	2016-01-28 13:51:55 -08:00
Tejas Patil	676803963f	[SPARK-12926][SQL] SQLContext to display warning message when non-sql configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes #10849 from tejasapatil/SPARK-12926.	2016-01-28 13:45:28 -08:00
Cheng Lian	415d0a859b	[SPARK-12818][SQL] Specialized integral and string types for Count-min Sketch This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes #10968 from liancheng/cms-specialized.	2016-01-28 12:26:03 -08:00
Nong Li	4a09123212	[SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBatch.Row These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes #10952 from nongli/spark-13045.	2016-01-27 15:35:31 -08:00
Herman van Hovell	ef96cd3c52	[SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQlParser commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and https://github.com/apache/spark/pull/10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches https://github.com/apache/spark/pull/10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10905 from hvanhovell/SPARK-12866.	2016-01-27 13:45:00 -08:00
Wenchen Fan	680afabe78	[SPARK-12938][SQL] DataFrame API for Bloom filter This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10937 from cloud-fan/bloom-filter.	2016-01-27 13:29:09 -08:00
Andrew Or	87abcf7df9	[SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: SPARK-12895: Implement TaskMetrics using accumulators. TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. SPARK-12896: Send only accumulator updates to the driver. Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes #10835 from andrewor14/task-metrics-use-accums.	2016-01-27 11:15:48 -08:00
Jason Lee	edd473751b	[SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes #8969 from jasoncl/SPARK-10847.	2016-01-27 09:55:10 -08:00
Cheng Lian	58f5d8c1da	[SPARK-12728][SQL] Integrates SQL generation with native view This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.	2016-01-26 20:30:13 -08:00
Cheng Lian	ce38a35b76	[SPARK-12935][SQL] DataFrame API for Count-Min Sketch This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes #10911 from liancheng/cms-df-api.	2016-01-26 20:12:34 -08:00
Nong Li	555127387a	[SPARK-12854][SQL] Implement complex types support in ColumnarBatch This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes #10820 from nongli/spark-12854.	2016-01-26 17:34:01 -08:00
Cheng Lian	83507fea9f	[SQL] Minor Scaladoc format fix Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes #10926 from liancheng/agg-doc-fix.	2016-01-26 14:29:29 -08:00
Sameer Agarwal	08c781ca67	[SPARK-12682][SQL] Add support for (optionally) not storing tables in hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata.	2016-01-26 07:50:37 -08:00
Sean Owen	649e9d0f5b	[SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes #10413 from srowen/SPARK-3369.	2016-01-26 11:55:28 +00:00
Reynold Xin	d54cfed5a6	[SQL][MINOR] A few minor tweaks to CSV reader. This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <rxin@databricks.com> Closes #10919 from rxin/csv-minor.	2016-01-26 00:51:08 -08:00
Wenchen Fan	be375fcbd2	[SPARK-12879] [SQL] improve the unsafe row writing framework As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: old version ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` new version ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #10809 from cloud-fan/unsafe-projection.	2016-01-25 16:23:59 -08:00
gatorsmile	9348431da2	[SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example, ``` df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .saveAsTable("bucketed_table") ``` However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table. Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10891 from gatorsmile/commonKeysInPartitionByBucketBy.	2016-01-25 13:38:09 -08:00
Yin Huai	00026fa991	[SPARK-12901][SQL][HOT-FIX] Fix scala 2.11 compilation.	2016-01-25 12:59:11 -08:00
Davies Liu	7d877c3439	[SPARK-12902] [SQL] visualization for generated operators This PR brings back visualization for generated operators, they looks like: ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png) ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png) Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode. Author: Davies Liu <davies@databricks.com> Closes #10828 from davies/viz_codegen.	2016-01-25 12:44:20 -08:00
Andy Grove	d8e480521e	[SPARK-12932][JAVA API] improved error message for java type inference failure Author: Andy Grove <andygrove73@gmail.com> Closes #10865 from andygrove/SPARK-12932.	2016-01-25 09:22:10 +00:00
hyukjinkwon	3adebfc9a3	[SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not case class and same format). https://issues.apache.org/jira/browse/SPARK-12901 This PR refactors the options in JSON and CSV datasources. In more details, 1. `JSONOptions` uses the same format as `CSVOptions`. 2. Not case classes. 3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed) Author: hyukjinkwon <gurwls223@gmail.com> Closes #10895 from HyukjinKwon/SPARK-12901.	2016-01-25 00:57:56 -08:00
Cheng Lian	3327fd2817	[SPARK-12624][PYSPARK] Checks row length when converting Java arrays to Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624.	2016-01-24 19:40:34 -08:00
Josh Rosen	f4004601b0	[SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT build ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive). This patch attempts to improve the isolation of these tests in order to address this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.	2016-01-24 11:29:27 -08:00
Reynold Xin	423783a08b	[SPARK-12904][SQL] Strength reduction for integral and decimal literal comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes #10882 from rxin/SPARK-12904-1.	2016-01-23 12:13:05 -08:00
hyukjinkwon	5af5a02160	[SPARK-12872][SQL] Support to specify the option for compression codec for JSON datasource https://issues.apache.org/jira/browse/SPARK-12872 This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with https://github.com/apache/spark/pull/10805. As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10858 from HyukjinKwon/SPARK-12872.	2016-01-22 23:53:12 -08:00
gatorsmile	e13c147e74	[SPARK-12959][SQL] Writing Bucketed Data with Disabled Bucketing in SQLConf When users turn off bucketing in SQLConf, we should issue some messages to tell users these operations will be converted to normal way. Also added a test case for this scenario and fixed the helper function. Do you think this PR is helpful when using bucket tables? cloud-fan Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10870 from gatorsmile/bucketTableWritingTestcases.	2016-01-22 01:03:41 -08:00
Liang-Chi Hsieh	55c7dd031b	[SPARK-12747][SQL] Use correct type name for Postgres JDBC's real array https://issues.apache.org/jira/browse/SPARK-12747 Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real". Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10695 from viirya/fix-postgres-jdbc.	2016-01-21 18:55:28 -08:00
Yin Huai	d60f8d74ac	[SPARK-8968] [SQL] [HOT-FIX] Fix scala 2.11 build.	2016-01-20 17:48:18 -08:00
wangfei	015c8efb37	[SPARK-8968][SQL] external sort by the partition clomns when dynamic partitioning to optimize the memory overhead Now the hash based writer dynamic partitioning show the bad performance for big data and cause many small files and high GC. This patch we do external sort first so that each time we only need open one writer. before this patch: ![gc](https://cloud.githubusercontent.com/assets/7018048/9149788/edc48c6e-3dec-11e5-828c-9995b56e4d65.PNG) after this patch: ![gc-optimize-externalsort](https://cloud.githubusercontent.com/assets/7018048/9149794/60f80c9c-3ded-11e5-8a56-7ae18ddc7a2f.png) Author: wangfei <wangfei_hello@126.com> Author: scwf <wangfei1@huawei.com> Closes #7336 from scwf/dynamic-optimize-basedon-apachespark.	2016-01-20 17:11:52 -08:00
Davies Liu	b362239df5	[SPARK-12797] [SQL] Generated TungstenAggregate (without grouping keys) As discussed in #10786, the generated TungstenAggregate does not support imperative functions. For a query ``` sqlContext.range(10).filter("id > 1").groupBy().count() ``` The generated code will looks like: ``` /* 032 / if (!initAgg0) { / 033 / initAgg0 = true; / 034 / / 035 / // initialize aggregation buffer / 037 / long bufValue2 = 0L; / 038 / / 039 / / 040 / // initialize Range / 041 / if (!range_initRange5) { / 042 / range_initRange5 = true; ... / 071 / } / 072 / / 073 / while (!range_overflow8 && range_number7 < range_partitionEnd6) { / 074 / long range_value9 = range_number7; / 075 / range_number7 += 1L; / 076 / if (range_number7 < range_value9 ^ 1L < 0) { / 077 / range_overflow8 = true; / 078 / } / 079 / / 085 / boolean primitive11 = false; / 086 / primitive11 = range_value9 > 1L; / 087 / if (!false && primitive11) { / 092 / // do aggregate and update aggregation buffer / 099 / long primitive17 = -1L; / 100 / primitive17 = bufValue2 + 1L; / 101 / bufValue2 = primitive17; / 105 / } / 107 / } / 109 / / 110 / // output the result / 112 / bufferHolder25.reset(); / 114 / rowWriter26.initialize(bufferHolder25, 1); / 118 / rowWriter26.write(0, bufValue2); / 120 / result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize()); / 121 / currentRow = result24; / 122 / return; / 124 / } / 125 */ ``` cc nongli Author: Davies Liu <davies@databricks.com> Closes #10840 from davies/gen_agg.	2016-01-20 15:24:01 -08:00
Herman van Hovell	1017327930	[SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```. The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double. This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D``` cc davies rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10796 from hvanhovell/SPARK-12848.	2016-01-20 15:13:01 -08:00
Wenchen Fan	f3934a8d65	[SPARK-12888][SQL] benchmark the new hash expression Benchmark it on 4 different schemas, the result: ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For simple: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 31.47 266.54 1.00 X codegen version 64.52 130.01 0.49 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For normal: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 4068.11 0.26 1.00 X codegen version 1175.92 0.89 3.46 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For array: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 9276.70 0.06 1.00 X codegen version 14762.23 0.04 0.63 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For map: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 58869.79 0.01 1.00 X codegen version 9285.36 0.06 6.34 X ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10816 from cloud-fan/hash-benchmark.	2016-01-20 15:08:27 -08:00
gatorsmile	8f90c15187	[SPARK-12616][SQL] Making Logical Operator `Union` Support Arbitrary Number of Children The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one. `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10577 from gatorsmile/unionAllMultiChildren.	2016-01-20 14:59:30 -08:00
Rajesh Balamohan	ab4a6bfd11	[SPARK-12898] Consider having dummyCallSite for HiveTableScan Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan. Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10825 from rajeshbalamohan/SPARK-12898.	2016-01-20 11:30:03 -08:00
Rajesh Balamohan	e75e340a40	[SPARK-12925][SQL] Improve HiveInspectors.unwrap for StringObjectIns… Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant. Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png) Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10848 from rajeshbalamohan/SPARK-12925.	2016-01-20 11:20:26 -08:00
Davies Liu	8e4f894e98	[SPARK-12881] [SQL] subexpress elimination in mutable projection Author: Davies Liu <davies@databricks.com> Closes #10814 from davies/mutable_subexpr.	2016-01-20 10:02:40 -08:00
Reynold Xin	753b194511	[SPARK-12912][SQL] Add a test suite for EliminateSubQueries Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer. Author: Reynold Xin <rxin@databricks.com> Closes #10837 from rxin/optimizer-analyzer-comment.	2016-01-20 00:00:28 -08:00
hyukjinkwon	6844d36aea	[SPARK-12871][SQL] Support to specify the option for compression codec. https://issues.apache.org/jira/browse/SPARK-12871 This PR added an option to support to specify compression codec. This adds the option `codec` as an alias `compression` as filed in [SPARK-12668 ](https://issues.apache.org/jira/browse/SPARK-12668). Note that I did not add configurations for Hadoop 1.x as this `CsvRelation` is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10805 from HyukjinKwon/SPARK-12420.	2016-01-19 20:45:52 -08:00
Reynold Xin	3e84ef0a54	[SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen The three optimization cases are: 1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. Author: Reynold Xin <rxin@databricks.com> Closes #10827 from rxin/SPARK-12770.	2016-01-19 16:14:41 -08:00
Jakob Odersky	c78e2080e0	[SPARK-12816][SQL] De-alias type when generating schemas Call `dealias` on local types to fix schema generation for abstract type members, such as ```scala type KeyValue = (Int, String) ``` Add simple test Author: Jakob Odersky <jodersky@gmail.com> Closes #10749 from jodersky/aliased-schema.	2016-01-19 12:31:03 -08:00
Imran Rashid	4dbd316122	[SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings See https://issues.apache.org/jira/browse/SPARK-12560 This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled. I ran into this while trying to turn them back on with a different version of parquet. Figure it was good to fix now in any case. Author: Imran Rashid <irashid@cloudera.com> Closes #10510 from squito/SPARK-12560.	2016-01-19 12:24:21 -08:00
gatorsmile	b72e01e821	[SPARK-12867][SQL] Nullability of Intersect can be stricter JIRA: https://issues.apache.org/jira/browse/SPARK-12867 When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter. liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10812 from gatorsmile/nullabilityIntersect.	2016-01-19 11:35:58 -08:00
Andrew Or	b122c861cd	[SPARK-12887] Do not expose var's in TaskMetrics This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug. Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #10815 from andrewor14/get-or-create-metrics.	2016-01-19 10:58:51 -08:00
Wenchen Fan	e14817b528	[SPARK-12870][SQL] better format bucket id in file name for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust. Author: Wenchen Fan <wenchen@databricks.com> Closes #10799 from cloud-fan/fix-bucket.	2016-01-19 10:44:51 -08:00
proflin	c00744e60f	[SQL][MINOR] Fix one little mismatched comment according to the codes in interface.scala Author: proflin <proflin.me@gmail.com> Closes #10824 from proflin/master.	2016-01-19 00:15:43 -08:00
hyukjinkwon	453dae5671	[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas and R https://issues.apache.org/jira/browse/SPARK-12668 Spark CSV datasource has been being merged (filed in [SPARK-12420](https://issues.apache.org/jira/browse/SPARK-12420)). This is a quicky PR that simply renames several CSV options to similar Pandas and R. - Alias for delimiter -> sep - charset -> encoding Author: hyukjinkwon <gurwls223@gmail.com> Closes #10800 from HyukjinKwon/SPARK-12668.	2016-01-18 21:42:07 -08:00
gatorsmile	74ba84b64c	[HOT][BUILD] Changed the import order This PR is to fix the master's build break. The following tests failed due to the import order issues in the master. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49651/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49652/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49653/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10823 from gatorsmile/importOrder.	2016-01-18 19:40:10 -08:00
Davies Liu	323d51f1da	[SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoin Currently SortMergeJoin and BroadcastHashJoin do not support condition, the need a followed Filter for that, the result projection to generate UnsafeRow could be very expensive if they generate lots of rows and could be filtered mostly by condition. This PR brings the support of condition for SortMergeJoin and BroadcastHashJoin, just like other outer joins do. This could improve the performance of Q72 by 7x (from 120s to 16.5s). Author: Davies Liu <davies@databricks.com> Closes #10653 from davies/filter_join.	2016-01-18 17:29:54 -08:00
Reynold Xin	39ac56fc60	[SPARK-12889][SQL] Rename ParserDialect -> ParserInterface. Based on discussions in #10801, I'm submitting a pull request to rename ParserDialect to ParserInterface. Author: Reynold Xin <rxin@databricks.com> Closes #10817 from rxin/SPARK-12889.	2016-01-18 17:10:32 -08:00
Wenchen Fan	404190221a	[SPARK-12882][SQL] simplify bucket tests and add more comments Right now, the bucket tests are kind of hard to understand, this PR simplifies them and add more commetns. Author: Wenchen Fan <wenchen@databricks.com> Closes #10813 from cloud-fan/bucket-comment.	2016-01-18 15:10:04 -08:00
Wenchen Fan	4f11e3f2aa	[SPARK-12841][SQL] fix cast in filter In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like `filter`, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. Author: Wenchen Fan <wenchen@databricks.com> Closes #10781 from cloud-fan/bug.	2016-01-18 14:15:27 -08:00
Reynold Xin	38c3c0e31a	[SPARK-12855][SQL] Remove parser dialect developer API This pull request removes the public developer parser API for external parsers. Given everything a parser depends on (e.g. logical plans and expressions) are internal and not stable, external parsers will break with every release of Spark. It is a bad idea to create the illusion that Spark actually supports pluggable parsers. In addition, this also reduces incentives for 3rd party projects to contribute parse improvements back to Spark. Author: Reynold Xin <rxin@databricks.com> Closes #10801 from rxin/SPARK-12855.	2016-01-18 13:55:42 -08:00
Reynold Xin	44fcf992aa	[SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo. I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple. Author: Reynold Xin <rxin@databricks.com> Closes #10802 from rxin/SPARK-12873.	2016-01-18 11:08:44 -08:00
Dilip Biswal	db9a860589	[SPARK-12558][FOLLOW-UP] AnalysisException when multiple functions applied in GROUP BY clause Addresses the comments from Yin. https://github.com/apache/spark/pull/10520 Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10758 from dilipbiswal/spark-12558-followup.	2016-01-18 10:28:01 -08:00
Wenchen Fan	cede7b2a11	[SPARK-12860] [SQL] speed up safe projection for primitive types The idea is simple, use `SpecificMutableRow` instead of `GenericMutableRow` as result row for safe projection. A simple benchmark shows about 1.5x speed up for primitive types, code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-safeprojectionbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10790 from cloud-fan/safe-projection.	2016-01-17 09:11:43 -08:00
Davies Liu	3c0d2365d5	[SPARK-12796] [SQL] Whole stage codegen This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators. A micro benchmark show that a query with range, filter and projection could be 3X faster then before. It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan ``` Limit 10 +- Project [(id#5L + 1) AS (id + 1)#6L] +- Filter ((id#5L & 1) = 1) +- Range 0, 1, 4, 10, [id#5L] ``` will be translated into ``` Limit 10 +- WholeStageCodegen +- Project [(id#1L + 1) AS (id + 1)#2L] +- Filter ((id#1L & 1) = 1) +- Range 0, 1, 4, 10, [id#1L] ``` Here is the call graph to generate Java source for A and B (A support codegen, but B does not): ``` * WholeStageCodegen Plan A FakeInput Plan B * ========================================================================= * * -> execute() * \| * doExecute() --------> produce() * \| * doProduce() -------> produce() * \| * doProduce() ---> execute() * \| * consume() * doConsume() ------------\| * \| * doConsume() <----- consume() ``` A SparkPlan that support codegen need to implement doProduce() and doConsume(): ``` def doProduce(ctx: CodegenContext): (RDD[InternalRow], String) def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String ``` Author: Davies Liu <davies@databricks.com> Closes #10735 from davies/whole2.	2016-01-16 10:29:27 -08:00
Wenchen Fan	2f7d0b68a2	[SPARK-12856] [SQL] speed up hashCode of unsafe array We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead. A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10784 from cloud-fan/array-hashcode.	2016-01-16 00:38:17 -08:00
Davies Liu	242efb7546	[SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) into code generated classes This is a refactor to support codegen for aggregation and broadcast join. Author: Davies Liu <davies@databricks.com> Closes #10777 from davies/rename2.	2016-01-15 19:07:42 -08:00
Nong Li	9039333c0a	[SPARK-12644][SQL] Update parquet reader to be vectorized. This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch. There are a few particulars in the Parquet encodings that make this much more efficient. In particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are also very suited for this. This is a work in progress and does not affect the current execution. In subsequent patches, we will support more encodings and types before enabling this. Simple benchmarks indicate this can decode single ints about > 3x faster. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10593 from nongli/spark-12644.	2016-01-15 17:40:26 -08:00
Wenchen Fan	3b5ccb12b8	[SPARK-12649][SQL] support reading bucketed table This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases. TODO(follow-up PRs): * bucket pruning * avoid shuffle for bucketed table join when use any super-set of the bucketing key. (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed) * recognize hive bucketed table Author: Wenchen Fan <wenchen@databricks.com> Closes #10604 from cloud-fan/bucket-read.	2016-01-15 17:20:01 -08:00
Yin Huai	f6ddbb360a	[SPARK-12833][HOT-FIX] Reset the locale after we set it. Author: Yin Huai <yhuai@databricks.com> Closes #10778 from yhuai/resetLocale.	2016-01-15 16:03:05 -08:00
Herman van Hovell	7cd7f22025	[SPARK-12575][SQL] Grammar parity with existing SQL parser In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base. Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out: - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR removes this keyword. - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is not supported anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this. - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we remove this feature from the parser. It would be quite easy to implement such a feature as an Expression later on. - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed. cc rxin viirya marmbrus yhuai cloud-fan Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10745 from hvanhovell/SPARK-12575-2.	2016-01-15 15:19:10 -08:00
Wenchen Fan	3f1c58d60b	[SQL][MINOR] BoundReference do not need to be NamedExpression We made it a `NamedExpression` to workaroud some hacky cases long time ago, and now seems it's safe to remove it. Author: Wenchen Fan <wenchen@databricks.com> Closes #10765 from cloud-fan/minor.	2016-01-15 14:20:22 -08:00
Julien Baley	0bb73554a9	Fix typo disvoered => discovered Author: Julien Baley <julien.baley@gmail.com> Closes #10773 from julienbaley/patch-1.	2016-01-15 13:53:20 -08:00
Yin Huai	513266c042	[SPARK-12833][HOT-FIX] Fix scala 2.11 compilation. Seems `5f83c6991c` breaks scala 2.11 compilation. Author: Yin Huai <yhuai@databricks.com> Closes #10774 from yhuai/fixScala211Compile.	2016-01-15 13:17:29 -08:00
Hossein	5f83c6991c	[SPARK-12833][SQL] Initial import of spark-csv CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL. This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests. Author: Hossein <hossein@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #10766 from rxin/csv.	2016-01-15 11:46:46 -08:00
Davies Liu	c5e7076da7	[MINOR] [SQL] GeneratedExpressionCode -> ExprCode GeneratedExpressionCode is too long Author: Davies Liu <davies@databricks.com> Closes #10767 from davies/renaming.	2016-01-15 08:26:20 -08:00
Reynold Xin	fe7246fea6	[SPARK-12830] Java style: disallow trailing whitespaces. Author: Reynold Xin <rxin@databricks.com> Closes #10764 from rxin/SPARK-12830.	2016-01-14 23:33:45 -08:00
Michael Armbrust	cc7af86afd	[SPARK-12813][SQL] Eliminate serialization for back to back operations The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations. In order to achieve this I also made the following simplifications: - Operators no longer have hold encoders, instead they have only the expressions that they need. The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process. now that they are visible we can change them on a case by case basis. - Operators no longer have type parameters. Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication. We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded. Deferred to a follow up PR: - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation. - Eliminate serializations in more cases by adding more cases to `EliminateSerialization` Author: Michael Armbrust <michael@databricks.com> Closes #10747 from marmbrus/encoderExpressions.	2016-01-14 17:44:56 -08:00
Reynold Xin	902667fd27	[SPARK-12771][SQL] Simplify CaseWhen code generation The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient. This closes #10737. Author: Reynold Xin <rxin@databricks.com> Closes #10755 from rxin/SPARK-12771.	2016-01-14 10:09:03 -08:00
Wenchen Fan	962e9bcf94	[SPARK-12756][SQL] use hash expression in Exchange This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one. This PR also fixes the tests that are broken by the new hash behaviour in shuffle. Author: Wenchen Fan <wenchen@databricks.com> Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.	2016-01-13 22:43:28 -08:00
Reynold Xin	cbbcd8e425	[SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field. Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls. Author: Reynold Xin <rxin@databricks.com> Closes #10734 from rxin/simplify-case.	2016-01-13 12:44:35 -08:00
Wenchen Fan	c2ea79f96a	[SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row https://issues.apache.org/jira/browse/SPARK-12642 Author: Wenchen Fan <wenchen@databricks.com> Closes #10694 from cloud-fan/hash-expr.	2016-01-13 12:29:02 -08:00
Liang-Chi Hsieh	63eee86cc6	[SPARK-9297] [SQL] Add covar_pop and covar_samp JIRA: https://issues.apache.org/jira/browse/SPARK-9297 Add two aggregation functions: covar_pop and covar_samp. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10029 from viirya/covar-funcs.	2016-01-13 10:26:55 -08:00
Kousuke Saruta	cb7b864a24	[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",") Fix the style violation (space before , and :). This PR is a followup for #10643 and rework of #10685 . Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10732 from sarutak/SPARK-12692-followup-sql.	2016-01-12 22:25:20 -08:00
Dilip Biswal	dc7b3870fc	[SPARK-12558][SQL] AnalysisException when multiple functions applied in GROUP BY clause cloud-fan Can you please take a look ? In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10520 from dilipbiswal/spark-12558.	2016-01-12 21:41:46 -08:00
Reynold Xin	b3b9ad23cf	[SPARK-12788][SQL] Simplify BooleanEquality by using casts. Author: Reynold Xin <rxin@databricks.com> Closes #10730 from rxin/SPARK-12788.	2016-01-12 18:45:55 -08:00
Nong Li	9247084962	[SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution. There are many potential benefits of having an efficient in memory columnar format as an alternate to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The remaining implementation can be done as follow up patches. As stated in the in the JIRA, there are useful external components that operate on memory in a simple columnar format. ColumnarBatch would serve that purpose and could server as a zero-serialization/zero-copy exchange for this use case. This patch supports running the underlying data either on heap or off heap. On heap runs a bit faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one interface (ColumnVector). This differs from Parquet or the existing columnar cache because this is not intended to be used as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these batches in memory per task. The layout of the values is just dense arrays of the value type. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10628 from nongli/spark-12635.	2016-01-12 18:21:04 -08:00
Cheng Lian	8ed5f12d2b	[SPARK-12724] SQL generation support for persisted data source tables This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation. Author: Cheng Lian <lian@databricks.com> Closes #10712 from liancheng/spark-12724-datasources-sql-gen.	2016-01-12 14:19:53 -08:00
Reynold Xin	0d543b98f3	Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")" This reverts commit `8cfa218f4f`.	2016-01-12 12:56:52 -08:00
Reynold Xin	0ed430e315	[SPARK-12768][SQL] Remove CaseKeyWhen expression This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer. Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination. Author: Reynold Xin <rxin@databricks.com> Closes #10722 from rxin/SPARK-12768.	2016-01-12 11:13:08 -08:00
Robert Kruszewski	508592b1ba	[SPARK-9843][SQL] Make catalyst optimizer pass pluggable at runtime Let me know whether you'd like to see it in other place Author: Robert Kruszewski <robertk@palantir.com> Closes #10210 from robert3005/feature/pluggable-optimizer.	2016-01-12 11:09:28 -08:00
Reynold Xin	1d88879530	[SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization rule This pull request does a few small things: 1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here. 2. Added unit test for SimplifyConditionals. 3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite Author: Reynold Xin <rxin@databricks.com> Closes #10716 from rxin/SPARK-12762.	2016-01-12 10:58:57 -08:00
Kousuke Saruta	8cfa218f4f	[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10718 from sarutak/SPARK-12692-followup-sql.	2016-01-12 00:51:00 -08:00
Yin Huai	aaa2c3b628	[SPARK-11823] Ignores HiveThriftBinaryServerSuite's test jdbc cancel https://issues.apache.org/jira/browse/SPARK-11823 This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test. Author: Yin Huai <yhuai@databricks.com> Closes #10715 from yhuai/SPARK-11823-ignore.	2016-01-11 19:59:15 -08:00
Cheng Lian	36d493509d	[SPARK-12498][SQL][MINOR] BooleanSimplication simplification Scala syntax allows binary case classes to be used as infix operator in pattern matching. This PR makes use of this syntax sugar to make `BooleanSimplification` more readable. Author: Cheng Lian <lian@databricks.com> Closes #10445 from liancheng/boolean-simplification-simplification.	2016-01-11 18:42:26 -08:00
wangfei	473907adf6	[SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists exception ``` [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.LogicalPlanToSQLSuite * ABORTED * (325 milliseconds) [info] org.apache.spark.sql.AnalysisException: Table `t1` already exists.; [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296) [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) ``` /cc liancheng Author: wangfei <wangfei_hello@126.com> Closes #10682 from scwf/fix-test.	2016-01-11 18:18:44 -08:00
Herman van Hovell	fe9eb0b0ce	[SPARK-12576][SQL] Enable expression parsing in CatalystQl The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)``` We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10649 from hvanhovell/SPARK-12576.	2016-01-11 16:29:37 -08:00
Anatoliy Plastinin	9559ac5f74	[SPARK-12744][SQL] Change parsing JSON integers to timestamps to treat integers as number of seconds JIRA: https://issues.apache.org/jira/browse/SPARK-12744 This PR makes parsing JSON integers to timestamps consistent with casting behavior. Author: Anatoliy Plastinin <anatoliy.plastinin@gmail.com> Closes #10687 from antlypls/fix-json-timestamp-parsing.	2016-01-11 10:28:57 -08:00
Wenchen Fan	f253feff62	[SPARK-12539][FOLLOW-UP] always sort in partitioning writer address comments in #10498 , especially https://github.com/apache/spark/pull/10498#discussion_r49021259 Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #10638 from cloud-fan/bucket-write.	2016-01-11 00:44:33 -08:00
Marcelo Vanzin	6439a82503	[SPARK-3873][BUILD] Enable import ordering error checking. Turn import ordering violations into build errors, plus a few adjustments to account for how the checker behaves. I'm a little on the fence about whether the existing code is right, but it's easier to appease the checker than to discuss what's the more correct order here. Plus a few fixes to imports that cropped in since my recent cleanups. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10612 from vanzin/SPARK-3873-enable.	2016-01-10 20:04:50 -08:00
Reynold Xin	b23c4521f5	[SPARK-12340] Fix overflow in various take functions. This is a follow-up for the original patch #10562. Author: Reynold Xin <rxin@databricks.com> Closes #10670 from rxin/SPARK-12340.	2016-01-09 11:21:58 -08:00
Liang-Chi Hsieh	95cd5d95ce	[SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause JIRA: https://issues.apache.org/jira/browse/SPARK-12577 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10620 from viirya/fix-parentheses.	2016-01-08 21:48:06 -08:00
Cheng Lian	d9447cac74	[SPARK-12593][SQL] Converts resolved logical plan back to SQL This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized. The current version is still in WIP status, and is quite limited. Known limitations include: 1. The logical plan must be analyzed but not optimized The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary. 1. The logical plan must be created using HiveQL query string Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan ``` Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` need to be canonicalized into the following form before SQL generation: ``` Project [a#1, b#2, c#3] +- Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` Otherwise, the SQL generation process will have to handle a large number of special cases. 1. Only a fraction of expressions and basic logical plan operators are supported in this PR Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings. Known unsupported components are: - Expressions - Part of math expressions - Part of string expressions (buggy?) - Null expressions - Calendar interval literal - Part of date time expressions - Complex type creators - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN` - Logical plan operators/patterns - Cube, rollup, and grouping set - Script transformation - Generator - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule - Window functions Support for window functions, generators, and cubes etc. will be added in follow-up PRs. This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner: * For all select queries, we try to convert it back to SQL * If the query plan is convertible, we parse the generated SQL into a new logical plan * Run the new logical plan instead of the original one If the query plan is inconvertible, the test case simply falls back to the original logic. TODO - [x] Fix failed test cases - [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.) - [x] Comments and documentation Author: Cheng Lian <lian@databricks.com> Closes #10541 from liancheng/sql-generation.	2016-01-08 14:08:13 -08:00
Liang-Chi Hsieh	cfe1ba56e4	[SPARK-12687] [SQL] Support from clause surrounded by `()`. JIRA: https://issues.apache.org/jira/browse/SPARK-12687 Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10660 from viirya/fix-union.	2016-01-08 09:50:41 -08:00
Sean Owen	b9c8353378	[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.	2016-01-08 17:47:44 +00:00
Reynold Xin	726bd3c4ec	Fix indentation for the previous patch.	2016-01-07 21:15:43 -08:00
Kevin Yu	5028a001d5	[SPARK-12317][SQL] Support units (m,k,g) in SQLConf This PR is continue from previous closed PR 10314. In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input. For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file. marmbrus srowen : Can you help review this code changes ? Thanks. Author: Kevin Yu <qyu@us.ibm.com> Closes #10629 from kevinyu98/spark-12317.	2016-01-07 21:13:17 -08:00
Kazuaki Ishizaki	34dbc8af21	[SPARK-12580][SQL] Remove string concatenations from usage and extended in @ExpressionDescription Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` The policy is here, as describe at https://github.com/apache/spark/pull/10488 Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10524 from kiszk/SPARK-12580.	2016-01-07 13:56:34 -08:00
Jacek Laskowski	07b314a57a	[MINOR] Fix for BUILD FAILURE for Scala 2.11 It was introduced in `917d3fc069` /cc cloud-fan rxin Author: Jacek Laskowski <jacek@japila.pl> Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.	2016-01-07 10:39:46 -08:00
Sameer Agarwal	f194d9911a	[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits https://issues.apache.org/jira/browse/SPARK-12662 cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #10626 from sameeragarwal/randomsplit.	2016-01-07 10:37:15 -08:00
Davies Liu	fd1dcfaf26	[SPARK-12542][SQL] support except/intersect in HiveQl Parse the SQL query with except/intersect in FROM clause for HivQL. Author: Davies Liu <davies@databricks.com> Closes #10622 from davies/intersect.	2016-01-06 23:46:12 -08:00
Davies Liu	6a1c864ab6	[SPARK-12295] [SQL] external spilling for window functions This PR manage the memory used by window functions (buffered rows), also enable external spilling. After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G. Author: Davies Liu <davies@databricks.com> Closes #10605 from davies/unsafe_window.	2016-01-06 23:21:52 -08:00
Nong Li	a74d743cc7	[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10589 from nongli/spark-12640.	2016-01-06 19:20:43 -08:00
Wenchen Fan	917d3fc069	[SPARK-12539][SQL] support writing bucketed table This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example: ``` df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales") ``` When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write. Note that there may be multiply files for one bucket, as the data is distributed. Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway. Limitations: * Can't write bucketed data without hive metastore. * Can't insert bucketed data into existing hive tables. Author: Wenchen Fan <wenchen@databricks.com> Closes #10498 from cloud-fan/bucket-write.	2016-01-06 16:58:10 -08:00
Davies Liu	6f7ba6409a	[SPARK-12681] [SQL] split IdentifiersParser.g into two files To avoid to have a huge Java source (over 64K loc), that can't be compiled. cc hvanhovell Author: Davies Liu <davies@databricks.com> Closes #10624 from davies/split_ident.	2016-01-06 15:54:00 -08:00
Herman van Hovell	ea489f14f1	[SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10583 from hvanhovell/SPARK-12575.	2016-01-06 11:16:53 -08:00
Yash Datta	9061e777fd	[SPARK-11878][SQL] Eliminate distribute by in case group by is present with exactly the same grouping expressi For queries like : select <> from table group by a distribute by a we can eliminate distribute by ; since group by will anyways do a hash partitioning Also applicable when user uses Dataframe API Author: Yash Datta <Yash.Datta@guavus.com> Closes #9858 from saucam/eliminatedistribute.	2016-01-06 10:37:53 -08:00
QiangCai	5d871ea43e	[SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem. spark jira https://issues.apache.org/jira/browse/SPARK-12340 Author: QiangCai <david.caiq@gmail.com> Closes #10562 from QiangCai/bugfix.	2016-01-06 18:13:07 +09:00
Liang-Chi Hsieh	b2467b3810	[SPARK-12578][SQL] Distinct should not be silently ignored when used in an aggregate function with OVER clause JIRA: https://issues.apache.org/jira/browse/SPARK-12578 Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10557 from viirya/keep-distinct-hivesql.	2016-01-06 00:40:14 -08:00
Marcelo Vanzin	b3ba1be3b7	[SPARK-3873][TESTS] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.	2016-01-05 19:07:39 -08:00
sureshthalamati	0d42292f6a	[SPARK-12504][SQL] Masking credentials in the sql plan explain output for JDBC data sources. This fix masks JDBC credentials in the explain output. URL patterns to specify credential seems to be vary between different databases. Added a new method to dialect to mask the credentials according to the database specific URL pattern. While adding tests I noticed explain output includes array variable for partitions ([Lorg.apache.spark.Partition;3ff74546,). Modified the code to include the first, and last partition information. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #10452 from sureshthalamati/mask_jdbc_credentials_spark-12504.	2016-01-05 17:48:05 -08:00
Marcelo Vanzin	df8bd97520	[SPARK-3873][SQL] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10573 from vanzin/SPARK-3873-sql.	2016-01-05 16:48:59 -08:00
Nong	c26d174265	[SPARK-12636] [SQL] Update UnsafeRowParquetRecordReader to support reading files directly. As noted in the code, this change is to make this component easier to test in isolation. Author: Nong <nongli@gmail.com> Closes #10581 from nongli/spark-12636.	2016-01-05 13:47:24 -08:00

1 2 3 4 5 ...

2844 commits