ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yin Huai	9c0654d36c	Revert "[SPARK-11544][SQL] sqlContext doesn't use PathFilter" This reverts commit `54db797025`.	2015-11-18 18:41:40 -08:00
Nong Li	6d0848b53b	[SPARK-11787][SQL] Improve Parquet scan performance when using flat schemas. This patch adds an alternate to the Parquet RecordReader from the parquet-mr project that is much faster for flat schemas. Instead of using the general converter mechanism from parquet-mr, this directly uses the lower level APIs from parquet-columnar and a customer RecordReader that directly assembles into UnsafeRows. This is optionally disabled and only used for supported schemas. Using the tpcds store sales table and doing a sum of increasingly more columns, the results are: For 1 Column: Before: 11.3M rows/second After: 18.2M rows/second For 2 Columns: Before: 7.2M rows/second After: 11.2M rows/second For 5 Columns: Before: 2.9M rows/second After: 4.5M rows/second Author: Nong Li <nong@databricks.com> Closes #9774 from nongli/parquet.	2015-11-18 18:38:45 -08:00
Reynold Xin	e61367b9f9	[SPARK-11833][SQL] Add Java tests for Kryo/Java Dataset encoders Also added some nicer error messages for incompatible types (private types and primitive types) for Kryo/Java encoder. Author: Reynold Xin <rxin@databricks.com> Closes #9823 from rxin/SPARK-11833.	2015-11-18 18:34:36 -08:00
Reynold Xin	5df08949f5	[SPARK-11810][SQL] Java-based encoder for opaque types in Datasets. This patch refactors the existing Kryo encoder expressions and adds support for Java serialization. Author: Reynold Xin <rxin@databricks.com> Closes #9802 from rxin/SPARK-11810.	2015-11-18 15:42:07 -08:00
Dilip Biswal	54db797025	[SPARK-11544][SQL] sqlContext doesn't use PathFilter Apply the user supplied pathfilter while retrieving the files from fs. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9652 from dilipbiswal/spark-11544.	2015-11-18 14:05:18 -08:00
JihongMa	09ad9533d5	[SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats function return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null. Author: JihongMa <linlin200605@gmail.com> Closes #9705 from JihongMA/SPARK-11720.	2015-11-18 13:03:37 -08:00
Davies Liu	94624eacb0	[SPARK-11739][SQL] clear the instantiated SQLContext Currently, if the first SQLContext is not removed after stopping SparkContext, a SQLContext could set there forever. This patch make this more robust. Author: Davies Liu <davies@databricks.com> Closes #9706 from davies/clear_context.	2015-11-18 11:53:28 -08:00
Wenchen Fan	dbf428c87a	[SPARK-11795][SQL] combine grouping attributes into a single NamedExpression we use `ExpressionEncoder.tuple` to build the result encoder, which assumes the input encoder should point to a struct type field if it’s non-flat. However, our keyEncoder always point to a flat field/fields: `groupingAttributes`, we should combine them into a single `NamedExpression`. Author: Wenchen Fan <wenchen@databricks.com> Closes #9792 from cloud-fan/agg.	2015-11-18 10:33:17 -08:00
Wenchen Fan	33b8373334	[SPARK-11725][SQL] correctly handle null inputs for UDF If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null. Author: Wenchen Fan <wenchen@databricks.com> Closes #9770 from cloud-fan/udf.	2015-11-18 10:23:12 -08:00
Wenchen Fan	cffb899c43	[SPARK-11803][SQL] fix Dataset self-join When we resolve the join operator, we may change the output of right side if self-join is detected. So in `Dataset.joinWith`, we should resolve the join operator first, and then get the left output and right output from it, instead of using `left.output` and `right.output` directly. Author: Wenchen Fan <wenchen@databricks.com> Closes #9806 from cloud-fan/self-join.	2015-11-18 10:15:50 -08:00
Reynold Xin	5e2b44474c	[SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803. Author: Reynold Xin <rxin@databricks.com> Closes #9789 from rxin/SPARK-11802.	2015-11-18 00:09:29 -08:00
Reynold Xin	91f4b6f2db	[SPARK-11797][SQL] collect, first, and take should use encoders for serialization They were previously using Spark's default serializer for serialization. Author: Reynold Xin <rxin@databricks.com> Closes #9787 from rxin/SPARK-11797.	2015-11-17 21:40:58 -08:00
Reynold Xin	ed8d1531f9	[SPARK-11793][SQL] Dataset should set the resolved encoders internally for maps. I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795. Author: Reynold Xin <rxin@databricks.com> Closes #9784 from rxin/SPARK-11503.	2015-11-17 19:02:44 -08:00
mayuanwen	e8833dd12c	[SPARK-11679][SQL] Invoking method " apply(fields: java.util.List[StructField])" in "StructType" gets ClassCastException In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;" I directly cast java.util.List[StructField] into Array[StructField] in this patch. Author: mayuanwen <mayuanwen@qiyi.com> Closes #9649 from jackieMaKing/Spark-11679.	2015-11-17 11:15:46 -08:00
Kevin Yu	e01865af0d	[SPARK-11447][SQL] change NullType to StringType during binaryComparison between NullType and StringType During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira. I proposal to the changes through this PR, can you review my code changes ? This problem only happen for <=>, other operators works fine. scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> df.registerTempTable("DF") scala> sqlContext.sql("select * from DF where 'column' = NULL") res27: org.apache.spark.sql.DataFrame = [column: string] scala> res27.show +------+ \|column\| +------+ +------+ Author: Kevin Yu <qyu@us.ibm.com> Closes #9720 from kevinyu98/working_on_spark-11447.	2015-11-16 22:54:29 -08:00
hyukjinkwon	75d2020731	[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.	2015-11-17 14:35:00 +08:00
Reynold Xin	fbad920dbf	[SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for current_timestamp). This patch adds an alias for current_timestamp (now function). Also fixes SPARK-9196 to re-enable the test case for current_timestamp. Author: Reynold Xin <rxin@databricks.com> Closes #9753 from rxin/SPARK-11768.	2015-11-16 20:47:46 -08:00
Wenchen Fan	fd14936be7	[SPARK-11625][SQL] add java test for typed aggregate Author: Wenchen Fan <wenchen@databricks.com> Closes #9591 from cloud-fan/agg-test.	2015-11-16 15:32:49 -08:00
Bartlomiej Alberski	31296628ac	[SPARK-11553][SQL] Primitive Row accessors should not convert null to default value Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details. Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com> Closes #9642 from alberskib/bugfix/SPARK-11553.	2015-11-16 15:14:38 -08:00
Zee Chen	985b38dd2f	[SPARK-11390][SQL] Query plan with/without filterPushdown indistinguishable …ishable Propagate pushed filters to PhyicalRDD in DataSourceStrategy.apply Author: Zee Chen <zeechen@us.ibm.com> Closes #9679 from zeocio/spark-11390.	2015-11-16 14:21:28 -08:00
hyukjinkwon	e388b39d10	[SPARK-11692][SQL] Support for Parquet logical types, JSON and BSON (embedded types) Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally. I searched a bit and found Apache drill also supports both in this way, [link](https://drill.apache.org/docs/parquet-format/). Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #9658 from HyukjinKwon/SPARK-11692.	2015-11-16 21:59:33 +08:00
hyukjinkwon	7f8eb3bf6e	[SPARK-11044][SQL] Parquet writer version fixed as version1 https://issues.apache.org/jira/browse/SPARK-11044 Spark writes a parquet file only with writer version1 ignoring the writer version given by user. So, in this PR, it keeps the writer version if given or sets version1 as default. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #9060 from HyukjinKwon/SPARK-11044.	2015-11-16 21:30:10 +08:00
Reynold Xin	42de5253f3	[SPARK-11745][SQL] Enable more JSON parsing options This patch adds the following options to the JSON data source, for dealing with non-standard JSON files: * `allowComments` (default `false`): ignores Java/C++ style comment in JSON records * `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names * `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes * `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012) To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options. Also updated documentation to explain these options. Scala ![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png) Python ![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png) Author: Reynold Xin <rxin@databricks.com> Closes #9724 from rxin/SPARK-11745.	2015-11-16 00:06:14 -08:00
Yin Huai	3e2e1873b2	[SPARK-11738] [SQL] Making ArrayType orderable https://issues.apache.org/jira/browse/SPARK-11738 Author: Yin Huai <yhuai@databricks.com> Closes #9718 from yhuai/makingArrayOrderable.	2015-11-15 13:59:59 -08:00
Reynold Xin	d22fc10887	[SPARK-11734][SQL] Rename TungstenProject -> Project, TungstenSort -> Sort I didn't remove the old Sort operator, since we still use it in randomized tests. I moved it into test module and renamed it ReferenceSort. Author: Reynold Xin <rxin@databricks.com> Closes #9700 from rxin/SPARK-11734.	2015-11-15 10:33:53 -08:00
Yin Huai	d83c2f9f0b	[SPARK-11736][SQL] Add monotonically_increasing_id to function registry. https://issues.apache.org/jira/browse/SPARK-11736 Author: Yin Huai <yhuai@databricks.com> Closes #9703 from yhuai/MonotonicallyIncreasingID.	2015-11-14 21:04:18 -08:00
hyukjinkwon	139c15b624	[SPARK-11694][SQL] Parquet logical types are not being tested properly All the physical types are properly tested at `ParquetIOSuite` but logical type mapping is not being tested. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #9660 from HyukjinKwon/SPARK-11694.	2015-11-14 18:36:01 +08:00
Wenchen Fan	23b8188f75	[SPARK-11654][SQL][FOLLOW-UP] fix some mistakes and clean up * rename `AppendColumn` to `AppendColumns` to be consistent with the physical plan name. * clean up stale comments. * always pass in resolved encoder to `TypedColumn.withInputType`(test added) * enable a mistakenly disabled java test. Author: Wenchen Fan <wenchen@databricks.com> Closes #9688 from cloud-fan/follow.	2015-11-13 11:13:09 -08:00
Yin Huai	7b5d9051cf	[SPARK-11678][SQL] Partition discovery should stop at the root path of the table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes #9651 from yhuai/SPARK-11678.	2015-11-13 18:36:56 +08:00
Michael Armbrust	41bbd23004	[SPARK-11654][SQL] add reduce to GroupedDataset This PR adds a new method, `reduce`, to `GroupedDataset`, which allows similar operations to `reduceByKey` on a traditional `PairRDD`. ```scala val ds = Seq("abc", "xyz", "hello").toDS() ds.groupBy(_.length).reduce(_ + _).collect() // not actually commutative :P res0: Array(3 -> "abcxyz", 5 -> "hello") ``` While implementing this method and its test cases several more deficiencies were found in our encoder handling. Specifically, in order to support positional resolution, named resolution and tuple composition, it is important to keep the unresolved encoder around and to use it when constructing new `Datasets` with the same object type but different output attributes. We now divide the encoder lifecycle into three phases (that mirror the lifecycle of standard expressions) and have checks at various boundaries: - Unresoved Encoders: all users facing encoders (those constructed by implicits, static methods, or tuple composition) are unresolved, meaning they have only `UnresolvedAttributes` for named fields and `BoundReferences` for fields accessed by ordinal. - Resolved Encoders: internal to a `[Grouped]Dataset` the encoder is resolved, meaning all input has been resolved to a specific `AttributeReference`. Any encoders that are placed into a logical plan for use in object construction should be resolved. - BoundEncoder: Are constructed by physical plans, right before actual conversion from row -> object is performed. It is left to future work to add explicit checks for resolution and provide good error messages when it fails. We might also consider enforcing the above constraints in the type system (i.e. `fromRow` only exists on a `ResolvedEncoder`), but we should probably wait before spending too much time on this. Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9673 from marmbrus/pr/9628.	2015-11-12 17:20:30 -08:00
JihongMa	d292f74831	[SPARK-11420] Updating Stddev support via Imperative Aggregate switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.	2015-11-12 13:47:34 -08:00
hyukjinkwon	f5a9526fec	[SPARK-10113][SQL] Explicit error message for unsigned Parquet logical types Parquet supports some unsigned datatypes. However, Since Spark does not support unsigned datatypes, it needs to emit an exception with a clear message rather then with the one saying illegal datatype. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9646 from HyukjinKwon/SPARK-10113.	2015-11-12 12:29:50 -08:00
Reynold Xin	30e7433643	[SPARK-11673][SQL] Remove the normal Project physical operator (and keep TungstenProject) Also make full outer join being able to produce UnsafeRows. Author: Reynold Xin <rxin@databricks.com> Closes #9643 from rxin/SPARK-11673.	2015-11-12 08:14:08 -08:00
Yin Huai	14cf753704	[SPARK-11661][SQL] Still pushdown filters returned by unhandledFilters. https://issues.apache.org/jira/browse/SPARK-11661 Author: Yin Huai <yhuai@databricks.com> Closes #9634 from yhuai/unhandledFilters.	2015-11-12 16:47:00 +08:00
Daoyuan Wang	39b1e36fbc	[SPARK-11396] [SQL] add native implementation of datetime function to_unix_timestamp `to_unix_timestamp` is the deterministic version of `unix_timestamp`, as it accepts at least one parameters. Since the behavior here is quite similar to `unix_timestamp`, I think the dataframe API is not necessary here. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9347 from adrian-wang/to_unix_timestamp.	2015-11-11 20:36:21 -08:00
Reynold Xin	e49e723392	[SPARK-11675][SQL] Remove shuffle hash joins. Author: Reynold Xin <rxin@databricks.com> Closes #9645 from rxin/SPARK-11675.	2015-11-11 19:32:52 -08:00
Andrew Ray	b8ff6888e7	[SPARK-8992][SQL] Add pivot to dataframe api This adds a pivot method to the dataframe api. Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer. Currently the syntax is like: ~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~ ~~Would we be interested in the following syntax also/alternatively? and~~ courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings")) //or courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings")) Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right? ~~Also what would be the suggested Java friendly method signature for this?~~ Author: Andrew Ray <ray.andrew@gmail.com> Closes #7841 from aray/sql-pivot.	2015-11-11 16:23:24 -08:00
Reynold Xin	a9a6b80c71	[SPARK-11645][SQL] Remove OpenHashSet for the old aggregate. Author: Reynold Xin <rxin@databricks.com> Closes #9621 from rxin/SPARK-11645.	2015-11-11 12:48:51 -08:00
Reynold Xin	df97df2b39	[SPARK-11644][SQL] Remove the option to turn off unsafe and codegen. Author: Reynold Xin <rxin@databricks.com> Closes #9618 from rxin/SPARK-11644.	2015-11-11 12:47:02 -08:00
Josh Rosen	529a1d3380	[SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled classes This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8. In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml. http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes. I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8. Author: Josh Rosen <joshrosen@databricks.com> Closes #9512 from JoshRosen/SPARK-6152.	2015-11-11 11:16:39 -08:00
Wenchen Fan	ec2b807212	[SPARK-11564][SQL][FOLLOW-UP] clean up java tuple encoder We need to support custom classes like java beans and combine them into tuple, and it's very hard to do it with the TypeTag-based approach. We should keep only the compose-based way to create tuple encoder. This PR also move `Encoder` to `org.apache.spark.sql` Author: Wenchen Fan <wenchen@databricks.com> Closes #9567 from cloud-fan/java.	2015-11-11 10:52:23 -08:00
Wenchen Fan	9c57bc0efc	[SPARK-11656][SQL] support typed aggregate in project list insert `aEncoder` like we do in `agg` Author: Wenchen Fan <wenchen@databricks.com> Closes #9630 from cloud-fan/select.	2015-11-11 10:21:53 -08:00
Wenchen Fan	c964fc1015	[SQL][MINOR] rename present to finish in Aggregator Author: Wenchen Fan <wenchen@databricks.com> Closes #9617 from cloud-fan/tmp.	2015-11-11 10:19:09 -08:00
Michael Armbrust	724cf7a38c	[SPARK-11616][SQL] Improve toString for Dataset Author: Michael Armbrust <michael@databricks.com> Closes #9586 from marmbrus/dataset-toString.	2015-11-10 14:30:19 -08:00
Nong Li	87aedc48c0	[SPARK-10371][SQL] Implement subexpr elimination for UnsafeProjections This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9480 from nongli/spark-10371.	2015-11-10 11:28:53 -08:00
Wenchen Fan	53600854c2	[SPARK-11590][SQL] use native json_tuple in lateral view Author: Wenchen Fan <wenchen@databricks.com> Closes #9562 from cloud-fan/json-tuple.	2015-11-10 11:21:31 -08:00
Wenchen Fan	dfcfcbcc04	[SPARK-11578][SQL][FOLLOW-UP] complete the user facing api for typed aggregation Currently the user facing api for typed aggregation has some limitations: * the customized typed aggregation must be the first of aggregation list * the customized typed aggregation can only use long as buffer type * the customized typed aggregation can only use flat type as result type This PR tries to remove these limitations. Author: Wenchen Fan <wenchen@databricks.com> Closes #9599 from cloud-fan/agg.	2015-11-10 11:14:25 -08:00
Yin Huai	e0701c7560	[SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s https://issues.apache.org/jira/browse/SPARK-9830 This PR contains the following main changes. * Removing `AggregateExpression1`. * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`. * Removing planner rule used to plan `Aggregate`. * Linking `MultipleDistinctRewriter` to analyzer. * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`. * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`. * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved). Author: Yin Huai <yhuai@databricks.com> Closes #9556 from yhuai/removeAgg1.	2015-11-10 11:06:29 -08:00
Davies Liu	521b3cae11	[SPARK-11598] [SQL] enable tests for ShuffledHashOuterJoin Author: Davies Liu <davies@databricks.com> Closes #9573 from davies/join_condition.	2015-11-09 23:28:32 -08:00
Michael Armbrust	9c740a9ddf	[SPARK-11578][SQL] User API for Typed Aggregation This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value. For example, the following aggregator extracts an `int` from a specific class and adds them up: ```scala case class Data(i: Int) val customSummer = new Aggregator[Data, Int, Int] { def prepare(d: Data) = d.i def reduce(l: Int, r: Int) = l + r def present(r: Int) = r }.toColumn() val ds: Dataset[Data] = ... val aggregated = ds.select(customSummer) ``` By using helper functions, users can make a generic `Aggregator` that works on any input type: ```scala /** An `Aggregator` that adds up any numeric type returned by the given function. / class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable { val numeric = implicitly[Numeric[N]] override def zero: N = numeric.zero override def reduce(b: N, a: I): N = numeric.plus(b, f(a)) override def present(reduction: N): N = reduction } def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn ``` These aggregators can then be used alongside other built-in SQL aggregations. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds .groupBy(_._1) .agg( sum(_._2), // The aggregator defined above. expr("sum(_2)").as[Int], // A built-in dynatically typed aggregation. count("")) // A built-in statically typed aggregation. .collect() res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L) ``` The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`. This will be improved in a followup PR. Author: Michael Armbrust <michael@databricks.com> Closes #9555 from marmbrus/dataset-useragg.	2015-11-09 16:11:00 -08:00
hyukjinkwon	9565c246ea	[SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters code Actually this was resolved by https://github.com/apache/spark/pull/8275. But I found the JIRA issue for this is not marked as resolved since the PR above was made for another issue but the PR above resolved both. I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add a little bit of corrections. In the previous PR, I refactored the test by not reducing just collecting filters; however, this would not test properly `And` filter (which is not given to the tests). I unintentionally changed this from the original way (before being refactored). In this PR, I just followed the original way to collect filters by reducing. I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9554 from HyukjinKwon/SPARK-9557.	2015-11-09 15:20:50 -08:00
Wenchen Fan	fcb57e9c73	[SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDataset created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction` Author: Wenchen Fan <wenchen@databricks.com> Closes #9564 from cloud-fan/map.	2015-11-09 15:16:47 -08:00
Wenchen Fan	d8b50f7029	[SPARK-11453][SQL] append data to partitioned table will messes up the result The reason is that: 1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`) 2. When append data to table, we use position to figure out how to match input columns to table's columns. So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105) Author: Wenchen Fan <wenchen@databricks.com> Closes #9408 from cloud-fan/append.	2015-11-08 21:01:53 -08:00
Reynold Xin	97b7080cf2	[SPARK-11564][SQL] Dataset Java API audit A few changes: 1. Removed fold, since it can be confusing for distributed collections. 2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction) 3. Added more documentation and test cases. The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function. Author: Reynold Xin <rxin@databricks.com> Closes #9531 from rxin/SPARK-11564.	2015-11-08 20:57:09 -08:00
Wenchen Fan	b2d195e137	[SPARK-11554][SQL] add map/flatMap to GroupedDataset Author: Wenchen Fan <wenchen@databricks.com> Closes #9521 from cloud-fan/map.	2015-11-08 12:59:35 -08:00
Herman van Hovell	30c8ba71a7	[SPARK-11451][SQL] Support single distinct count on multiple columns. This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9409 from hvanhovell/SPARK-11451.	2015-11-08 11:06:10 -08:00
Wenchen Fan	7e9a9e603a	[SPARK-11269][SQL] Java API support & test cases for Dataset This simply brings https://github.com/apache/spark/pull/9358 up-to-date. Author: Wenchen Fan <wenchen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #9528 from rxin/dataset-java.	2015-11-06 15:37:07 -08:00
Reynold Xin	3a652f691b	[SPARK-11561][SQL] Rename text data source's column name to value. Author: Reynold Xin <rxin@databricks.com> Closes #9527 from rxin/SPARK-11561.	2015-11-06 14:47:41 -08:00
Herman van Hovell	f328fedafd	[SPARK-11450] [SQL] Add Unsafe Row processing to Expand This PR enables the Expand operator to process and produce Unsafe Rows. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9414 from hvanhovell/SPARK-11450.	2015-11-06 12:21:53 -08:00
Imran Rashid	49f1a82037	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.	2015-11-06 20:06:24 +00:00
Yin Huai	8211aab079	[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up) https://issues.apache.org/jira/browse/SPARK-9858 This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments. Author: Yin Huai <yhuai@databricks.com> Closes #9453 from yhuai/numReducer-followUp.	2015-11-06 11:13:51 -08:00
Cheng Lian	c048929c6a	[SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399 This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up.	2015-11-06 11:11:36 -08:00
Liang-Chi Hsieh	574141a298	[SPARK-9162] [SQL] Implement code generation for ScalaUDF JIRA: https://issues.apache.org/jira/browse/SPARK-9162 Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9270 from viirya/scalaudf-codegen.	2015-11-06 10:52:04 -08:00
Michael Armbrust	363a476c3f	[SPARK-11528] [SQL] Typed aggregations for Datasets This PR adds the ability to do typed SQL aggregations. We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds.groupBy(_._1).agg(sum("_2").as[Int]).collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9499 from marmbrus/dataset-agg.	2015-11-05 21:42:32 -08:00
Reynold Xin	6091e91fca	Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs." This reverts commit `9cf56c96b7`.	2015-11-05 17:10:35 -08:00
Wenchen Fan	d9e30c59ce	[SPARK-10656][SQL] completely support special chars in DataFrame the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close https://github.com/apache/spark/pull/8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.	2015-11-05 14:53:16 -08:00
Davies Liu	81498dd5c8	[SPARK-11425] [SPARK-11486] Improve hybrid aggregation After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them. Author: Davies Liu <davies@databricks.com> Closes #9383 from davies/fix_switch.	2015-11-04 21:30:21 -08:00
Reynold Xin	b6e0a5ae6f	[SPARK-11510][SQL] Remove SQL aggregation tests for higher order statistics We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one. Author: Reynold Xin <rxin@databricks.com> Closes #9475 from rxin/SPARK-11510.	2015-11-04 16:49:25 -08:00
Reynold Xin	abf5e4285d	[SPARK-11504][SQL] API audit for distributeBy and localSort 1. Renamed localSort -> sortWithinPartitions to avoid ambiguity in "local" 2. distributeBy -> repartition to match the existing repartition. Author: Reynold Xin <rxin@databricks.com> Closes #9470 from rxin/SPARK-11504.	2015-11-04 12:33:47 -08:00
Liang-Chi Hsieh	de289bf279	[SPARK-10304][SQL] Following up checking valid dir structure for partition discovery This patch follows up #8840. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9459 from viirya/detect_invalid_part_dir_following.	2015-11-04 10:56:32 -08:00
Reynold Xin	3bd6f5d2ae	[SPARK-11490][SQL] variance should alias var_samp instead of var_pop. stddev is an alias for stddev_samp. variance should be consistent with stddev. Also took the chance to remove internal Stddev and Variance, and only kept StddevSamp/StddevPop and VarianceSamp/VariancePop. Author: Reynold Xin <rxin@databricks.com> Closes #9449 from rxin/SPARK-11490.	2015-11-04 09:34:52 -08:00
Wenchen Fan	2692bdb7db	[SPARK-11455][SQL] fix case sensitivity of partition by depend on `caseSensitive` to do column name equality check, instead of just `==` Author: Wenchen Fan <wenchen@databricks.com> Closes #9410 from cloud-fan/partition.	2015-11-03 20:25:58 -08:00
Nong	e352de0db2	[SPARK-11329] [SQL] Cleanup from spark-11329 fix. Author: Nong <nong@cloudera.com> Closes #9442 from nongli/spark-11483.	2015-11-03 16:44:37 -08:00
Reynold Xin	5051262d4c	[SPARK-11489][SQL] Only include common first order statistics in GroupedData We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions. That is to say, after this change, we won't support ```scala df.groupBy("key").kurtosis("colA", "colB") ``` However, we will still support ```scala df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB"))) ``` Author: Reynold Xin <rxin@databricks.com> Closes #9446 from rxin/SPARK-11489.	2015-11-03 16:27:56 -08:00
Wenchen Fan	f6fcb4874c	[SPARK-11477] [SQL] support create Dataset from RDD Author: Wenchen Fan <wenchen@databricks.com> Closes #9434 from cloud-fan/rdd2ds and squashes the following commits: 0892d72 [Wenchen Fan] support create Dataset from RDD	2015-11-04 00:15:50 +01:00
Cheng Lian	ebf8b0b48d	[SPARK-10978][SQL] Allow data sources to eliminate filters This PR adds a new method `unhandledFilters` to `BaseRelation`. Data sources which implement this method properly may avoid the overhead of defensive filtering done by Spark SQL. Author: Cheng Lian <lian@databricks.com> Closes #9399 from liancheng/spark-10978.unhandled-filters.	2015-11-03 10:07:45 -08:00
Liang-Chi Hsieh	d6035d97c9	[SPARK-10304] [SQL] Partition discovery should throw an exception if the dir structure is invalid JIRA: https://issues.apache.org/jira/browse/SPARK-10304 This patch detects if the structure of partition directories is not valid. The test cases are from #8547. Thanks zhzhan. cc liancheng Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8840 from viirya/detect_invalid_part_dir.	2015-11-03 07:41:50 -08:00
Daoyuan Wang	d188a67762	[SPARK-10533][SQL] handle scientific notation in sqlParser https://issues.apache.org/jira/browse/SPARK-10533 val df = sqlContext.createDataFrame(Seq(("a",1.0),("b",2.0),("c",3.0))) df.filter("_2 < 2.0e1").show Scientific notation didn't work. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9085 from adrian-wang/scinotation.	2015-11-03 22:30:23 +08:00
Michael Armbrust	b86f2cab67	[SPARK-11404] [SQL] Support for groupBy using column expressions This PR adds a new method `groupBy(cols: Column*)` to `Dataset` that allows users to group using column expressions instead of a lambda function. Since the return type of these expressions is not known at compile time, we just set the key type as a generic `Row`. If the user would like to work the key in a type-safe way, they can call `grouped.asKey[Type]`, which is also added in this PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() val grouped = ds.groupBy($"_1").asKey[String] val agged = grouped.mapGroups { case (g, iter) => Iterator((g, iter.map(_._2).sum)) } agged.collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9359 from marmbrus/columnGroupBy and squashes the following commits: bbcb03b [Michael Armbrust] Update DatasetSuite.scala 8fd2908 [Michael Armbrust] Update DatasetSuite.scala 0b0e2f8 [Michael Armbrust] [SPARK-11404] [SQL] Support for groupBy using column expressions	2015-11-03 13:02:17 +01:00
Wenchen Fan	425ff03f5a	[SPARK-11436] [SQL] rebind right encoder when join 2 datasets When we join 2 datasets, we will combine 2 encoders into a tupled one, and use it as the encoder for the jioned dataset. Assume both of the 2 encoders are flat, their `constructExpression`s both reference to the first element of input row. However, when we combine 2 encoders, the schema of input row changed, now the right encoder should reference to second element of input row. So we should rebind right encoder to let it know the new schema of input row before combine it. Author: Wenchen Fan <wenchen@databricks.com> Closes #9391 from cloud-fan/join and squashes the following commits: 846d3ab [Wenchen Fan] rebind right encoder when join 2 datasets	2015-11-03 12:47:39 +01:00
Yin Huai	d728d5c986	[SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins https://issues.apache.org/jira/browse/SPARK-9858 https://issues.apache.org/jira/browse/SPARK-9859 https://issues.apache.org/jira/browse/SPARK-9861 Author: Yin Huai <yhuai@databricks.com> Closes #9276 from yhuai/numReducer.	2015-11-03 00:12:49 -08:00
Yin Huai	9cf56c96b7	[SPARK-11469][SQL] Allow users to define nondeterministic udfs. This is the first task (https://issues.apache.org/jira/browse/SPARK-11469) of https://issues.apache.org/jira/browse/SPARK-11438 Author: Yin Huai <yhuai@databricks.com> Closes #9393 from yhuai/udfNondeterministic.	2015-11-02 21:18:38 -08:00
Nong Li	9cb5c731da	[SPARK-11329][SQL] Support star expansion for structs. 1. Supporting expanding structs in Projections. i.e. "SELECT s." where s is a struct type. This is fixed by allowing the expand function to handle structs in addition to tables. 2. Supporting expanding inside aggregate functions of structs. "SELECT max(struct(col1, structCol.*))" This requires recursively expanding the expressions. In this case, it it the aggregate expression "max(...)" and we need to recursively expand its children inputs. Author: Nong Li <nongli@gmail.com> Closes #9343 from nongli/spark-11329.	2015-11-02 20:32:08 -08:00
Nong Li	2cef1bb0b5	[SPARK-5354][SQL] Cached tables should preserve partitioning and ord… …ering. For cached tables, we can just maintain the partitioning and ordering from the source relation. Author: Nong Li <nongli@gmail.com> Closes #9404 from nongli/spark-5354.	2015-11-02 19:18:45 -08:00
Nong Li	046e32ed84	[SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's DISTRIBUTE BY and SORT BY. DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for optioning sorting within each resulting partition. There is no required relationship between the exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other). This patch adds to APIs to DataFrames which can be used together to provide this functionality: 1. distributeBy() which partitions the data frame into a specified number of partitions using the partitioning exprs. 2. localSort() which sorts each partition using the provided sorting exprs. To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...) Author: Nong Li <nongli@gmail.com> Closes #9364 from nongli/spark-11410.	2015-11-01 14:34:06 -08:00
Cheng Lian	aa494a9c2e	[SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources produce UnsafeRow This PR fixes two issues: 1. `PhysicalRDD.outputsUnsafeRows` is always `false` Thus a `ConvertToUnsafe` operator is often required even if the underlying data source relation does output `UnsafeRow`. 1. Internal/external row conversion for `HadoopFsRelation` is kinda messy Currently we're using `HadoopFsRelation.needConversion` and [dirty type erasure hacks][1] to indicate whether the relation outputs external row or internal row and apply external-to-internal conversion when necessary. Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, ORC, and Text output `InternalRow`, while typical external `HadoopFsRelation` data sources, e.g. spark-avro and spark-csv, output `Row`. This PR adds a `private[sql]` interface method `HadoopFsRelation.buildInternalScan`, which by default invokes `HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are also `InternalRow`s). All builtin `HadoopFsRelation` data sources override this method and directly output `UnsafeRow`s. In this way, now `HadoopFsRelation` always produces `UnsafeRow`s. Thus `PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the underlying data source is a `HadoopFsRelation`. A remaining question is that, can we assume that all non-builtin `HadoopFsRelation` data sources output external rows? At least all well known ones do so. However it's possible that some users implemented their own `HadoopFsRelation` data sources that leverages `InternalRow` and thus all those unstable internal data representations. If this assumption is safe, we can deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion code (like [here][2] and [here][3]). This PR supersedes #9125. Follow-ups: 1. Makes JSON and ORC data sources output `UnsafeRow` directly 1. Makes `HiveTableScan` output `UnsafeRow` directly This is related to 1 since ORC data source shares the same `Writable` unwrapping code with `HiveTableScan`. [1]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353 [2]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335 [3]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669 Author: Cheng Lian <lian@databricks.com> Closes #9305 from liancheng/spark-11345.unsafe-hadoop-fs-relation.	2015-10-31 21:16:09 -07:00
Jeff Zhang	97b3c8fb47	[SPARK-11226][SQL] Empty line in json file should be skipped Currently the empty line in json file will be parsed into Row with all null field values. But in json, "{}" represents a json object, empty line is supposed to be skipped. Make a trivial change for this. Author: Jeff Zhang <zjffdu@apache.org> Closes #9211 from zjffdu/SPARK-11226.	2015-10-31 11:10:37 +00:00
Yin Huai	3c471885dc	[SPARK-11434][SPARK-11103][SQL] Fix test ": Filter applied on merged Parquet schema with new column fails" https://issues.apache.org/jira/browse/SPARK-11434 Author: Yin Huai <yhuai@databricks.com> Closes #9387 from yhuai/SPARK-11434.	2015-10-30 20:05:07 -07:00
Wenchen Fan	14d08b9908	[SPARK-11393] [SQL] CoGroupedIterator should respect the fact that GroupedIterator.hasNext is not idempotent When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet. Author: Wenchen Fan <wenchen@databricks.com> Closes #9346 from cloud-fan/cogroup and squashes the following commits: 9be67c8 [Wenchen Fan] SPARK-11393	2015-10-30 12:17:51 +01:00
hyukjinkwon	59db9e9c38	[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9327 from HyukjinKwon/SPARK-11103.	2015-10-30 18:17:35 +08:00
Davies Liu	56419cf11f	[SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative memory management This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed. Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling). The PrepareRDD may be not needed anymore, could be removed in follow up PR. The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration). ```python sqlContext.setConf("spark.sql.shuffle.partitions", "1") df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s") df2 = df.select(df.id.alias('id2'), df.s.alias('s2')) j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2") j.explain() print j.count() ``` For thread-safety, here what I'm got: 1) Without calling spill(), the operators should only be used by single thread, no safety problems. 2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems. 3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it. 4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning. 5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter). Author: Davies Liu <davies@databricks.com> Closes #9241 from davies/force_spill.	2015-10-29 23:38:06 -07:00
Wenchen Fan	96cf87f66d	[SPARK-11301] [SQL] fix case sensitivity for filter on partitioned columns Author: Wenchen Fan <wenchen@databricks.com> Closes #9271 from cloud-fan/filter.	2015-10-29 16:36:52 -07:00
sethah	a01cbf5daa	[SPARK-10641][SQL] Add Skewness and Kurtosis Support Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics Author: sethah <seth.hendrickson16@gmail.com> Closes #9003 from sethah/SPARK-10641.	2015-10-29 11:58:39 -07:00
Wenchen Fan	f79ebf2a9e	[SPARK-11370] [SQL] fix a bug in GroupedIterator and create unit test for it Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops. Author: Wenchen Fan <wenchen@databricks.com> Closes #9330 from cloud-fan/group.	2015-10-29 11:49:45 +01:00
Wenchen Fan	075ce4914f	[SPARK-11313][SQL] implement cogroup on DataSets (support 2 datasets) A simpler version of https://github.com/apache/spark/pull/9279, only support 2 datasets. Author: Wenchen Fan <wenchen@databricks.com> Closes #9324 from cloud-fan/cogroup2.	2015-10-28 13:58:52 +01:00
Cheng Hao	d9c6039897	[SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some cases In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly. Author: Cheng Hao <hao.cheng@intel.com> Closes #8652 from chenghao-intel/cartesian.	2015-10-27 20:26:38 -07:00
Michael Armbrust	5a5f65905a	[SPARK-11347] [SQL] Support for joinWith in Datasets This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true. ```scala case class ClassData(a: String, b: Int) val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS() val ds2 = Seq(("a", 1), ("b", 2)).toDS() > ds1.joinWith(ds2, $"_1" === $"a").collect() res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2))) ``` This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`. This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common. ## Required Changes to Encoders In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered. Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples). As a result the following changes were made. - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types. Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple. - All internal operations on `Dataset`s now require an `ExpressionEncoder`. If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown. We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping. This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step. - Additionally, the semantics for different types of objects are now better defined. As stated in the `ExpressionEncoder` scaladoc: - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions and `UnresolvedExtractValue` expressions. - Tuples will have their subfields extracted by position using `BoundReference` expressions. - Primitives will have their values extracted from the first ordinal with a schema that defaults to the name `value`. - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase. Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`. This process replaces an unresolved expressions with concrete `AttributeReference` expressions. Binding then happens on demand, when an encoder is going to be used to construct an object. This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries. Author: Michael Armbrust <michael@databricks.com> Closes #9300 from marmbrus/datasets-tuples.	2015-10-27 13:28:52 -07:00
Yanbo Liang	360ed832f5	[SPARK-11303][SQL] filter should not be pushed down into sample When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9294 from yanboliang/spark-11303.	2015-10-27 11:28:59 +01:00
Stephen De Gennaro	82464fb2e0	[SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings Currently, when a schema is inferred from a JSON file using sqlContext.read.json, the primitive object types are inferred as string, long, boolean, etc. However, if the inferred type is too specific (JSON obviously does not enforce types itself), this can cause issues with merging dataframe schemas. This pull request adds the option "primitivesAsString" to the JSON DataFrameReader which when true (defaults to false if not set) will infer all primitives as strings. Below is an example usage of this new functionality. ``` val jsonDf = sqlContext.read.option("primitivesAsString", "true").json(sampleJsonFile) scala> jsonDf.printSchema() root \|-- bigInteger: string (nullable = true) \|-- boolean: string (nullable = true) \|-- double: string (nullable = true) \|-- integer: string (nullable = true) \|-- long: string (nullable = true) \|-- null: string (nullable = true) \|-- string: string (nullable = true) ``` Author: Stephen De Gennaro <stepheng@realitymine.com> Closes #9249 from stephend-realitymine/stephend-primitives.	2015-10-26 19:55:10 -07:00
Nong Li	d4c397a64a	[SPARK-11325] [SQL] Alias 'alias' in Scala's DataFrame API Author: Nong Li <nongli@gmail.com> Closes #9286 from nongli/spark-11325.	2015-10-26 18:27:02 -07:00
Frank Rosner	b60aab8a95	[SPARK-11258] Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory https://issues.apache.org/jira/browse/SPARK-11258 I was not able to locate an existing unit test for this function so I wrote one. Author: Frank Rosner <frank@fam-rosner.de> Closes #9222 from FRosner/master.	2015-10-26 15:46:59 -07:00
Wenchen Fan	07ced43424	[SPARK-11253] [SQL] reset all accumulators in physical operators before execute an action With this change, our query execution listener can get the metrics correctly. The UI still looks good after this change. <img width="257" alt="screen shot 2015-10-23 at 11 25 14 am" src="https://cloud.githubusercontent.com/assets/3182036/10683834/d516f37e-7978-11e5-8118-343ed40eb824.png"> <img width="494" alt="screen shot 2015-10-23 at 11 25 01 am" src="https://cloud.githubusercontent.com/assets/3182036/10683837/e1fa60da-7978-11e5-8ec8-178b88f27764.png"> Author: Wenchen Fan <wenchen@databricks.com> Closes #9215 from cloud-fan/metric.	2015-10-25 22:47:39 -07:00
Josh Rosen	85e654c5ec	[SPARK-10984] Simplify MemoryManager class structure This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes: - MemoryManager - StaticMemoryManager - ExecutorMemoryManager - TaskMemoryManager - ShuffleMemoryManager This is fairly confusing. To simplify things, this patch consolidates several of these classes: - ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager. - TaskMemoryManager is moved into Spark Core. Key changes and tasks: - [x] Merge ExecutorMemoryManager into MemoryManager. - [x] Move pooling logic into Allocator. - [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`. - [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager. - [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager. - [x] Merge ShuffleMemoryManager into MemoryManager. - [x] Move code - [x] ~~Simplify 1/n calculation.~~ Will defer to followup, since this needs more work.* - [x] Port ShuffleMemoryManagerSuite tests. - [x] Move classes from `unsafe` package to `memory` package. - [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction. - [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation - [x] AbstractBytesToBytesMapSuite - [x] UnsafeExternalSorterSuite - [x] UnsafeFixedWidthAggregationMapSuite - [x] UnsafeKVExternalSorterSuite Compatiblity notes: - This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task. Author: Josh Rosen <joshrosen@databricks.com> Closes #9127 from JoshRosen/SPARK-10984.	2015-10-25 21:19:52 -07:00
Reynold Xin	e1a897b657	[SPARK-11274] [SQL] Text data source support for Spark SQL. This adds API for reading and writing text files, similar to SparkContext.textFile and RDD.saveAsTextFile. ``` SQLContext.read.text("/path/to/something.txt") DataFrame.write.text("/path/to/write.txt") ``` Using the new Dataset API, this also supports ``` val ds: Dataset[String] = SQLContext.read.text("/path/to/something.txt").as[String] ``` Author: Reynold Xin <rxin@databricks.com> Closes #9240 from rxin/SPARK-11274.	2015-10-23 13:04:06 -07:00
Reynold Xin	cdea0174e3	[SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package Author: Reynold Xin <rxin@databricks.com> Closes #9239 from rxin/types-private.	2015-10-23 00:00:21 -07:00
Michael Armbrust	53e83a3a77	[SPARK-11116][SQL] First Draft of Dataset API This PR adds a new experimental API to Spark, tentitively named Datasets. A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations. Example usage is as follows: ### Functional ```scala > val ds: Dataset[Int] = Seq(1, 2, 3).toDS() > ds.filter(_ % 1 == 0).collect() res1: Array[Int] = Array(1, 2, 3) ``` ### Relational ```scala scala> ds.toDF().show() +-----+ \|value\| +-----+ \| 1\| \| 2\| \| 3\| +-----+ > ds.select(expr("value + 1").as[Int]).collect() res11: Array[Int] = Array(2, 3, 4) ``` ## Comparison to RDDs A `Dataset` differs from an `RDD` in the following ways: - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be used to serialize the object into a binary format. Encoders are also capable of mapping the schema of a given object to the Spark SQL type system. In contrast, RDDs rely on runtime reflection based serialization. - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored in the encoded form. This representation allows for additional logical operations and enables many operations (sorting, shuffling, etc.) to be performed without deserializing to an object. A `Dataset` can be converted to an `RDD` by calling the `.rdd` method. ## Comparison to DataFrames A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into specific Dataset by calling `df.as[ElementType]`. Similarly you can transform a strongly-typed `Dataset` to a generic DataFrame by calling `ds.toDF()`. ## Implementation Status and TODOs This is a rough cut at the least controversial parts of the API. The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API. The following is being deferred to future PRs: - Joins and Aggregations (prototype here `f11f91e6f0`) - Support for Java Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion. This is an internal detail, and what we are doing today works for the cases we care about. However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames). ## COMPATIBILITY NOTE Long term we plan to make `DataFrame` extend `Dataset[Row]`. However, making this change to che class hierarchy would break the function signatures for the existing function operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6. Author: Michael Armbrust <michael@databricks.com> Closes #9190 from marmbrus/dataset-infra.	2015-10-22 15:20:17 -07:00
Josh Rosen	f6d06adf05	[SPARK-10708] Consolidate sort shuffle implementations There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together. Author: Josh Rosen <joshrosen@databricks.com> Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.	2015-10-22 09:46:30 -07:00
Yanbo Liang	40a10d7675	[SPARK-9392][SQL] Dataframe drop should work on unresolved columns Dataframe drop should work on unresolved columns Author: Yanbo Liang <ybliang8@gmail.com> Closes #8821 from yanboliang/spark-9392.	2015-10-21 17:50:33 -07:00
Davies Liu	f8c6bec657	[SPARK-11197][SQL] run SQL on files directly This PR introduce a new feature to run SQL directly on files without create a table, for example: ``` select id from json.`path/to/json/files` as j ``` Author: Davies Liu <davies@databricks.com> Closes #9173 from davies/source.	2015-10-21 13:38:30 -07:00
Wenchen Fan	7c74ebca05	[SPARK-10743][SQL] keep the name of expression if possible when do cast Author: Wenchen Fan <cloud0fan@163.com> Closes #8859 from cloud-fan/cast.	2015-10-21 13:22:35 -07:00
Cheng Lian	89e6db6150	[SPARK-11153][SQL] Disables Parquet filter push-down for string and binary columns Due to PARQUET-251, `BINARY` columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - `StringType` - `BinaryType` - `DecimalType` (But Spark SQL doesn't support pushing down filters involving `DecimalType` columns for now.) To avoid wrong query results, we should disable filter push-down for columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8. Author: Cheng Lian <lian@databricks.com> Closes #9152 from liancheng/spark-11153.workaround-parquet-251. (cherry picked from commit `0887e5e878`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-10-21 09:02:59 +08:00
Davies Liu	67d468f8d9	[SPARK-11111] [SQL] fast null-safe join Currently, we use CartesianProduct for join with null-safe-equal condition. ``` scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain == Physical Plan == TungstenProject [i#2,j#3,i#7,j#8] Filter (i#2 <=> i#7) CartesianProduct LocalTableScan [i#2,j#3], [[1,1]] LocalTableScan [i#7,j#8], [[1,1]] ``` Actually, we can have an equal-join condition as `coalesce(i, default) = coalesce(b.i, default)`, then an partitioned join algorithm could be used. After this PR, the plan will become: ``` >>> sqlContext.sql("select * from a join b ON a.id <=> b.id").explain() TungstenProject [id#0L,id#1L] Filter (id#0L <=> id#1L) SortMergeJoin [coalesce(id#0L,0)], [coalesce(id#1L,0)] TungstenSort [coalesce(id#0L,0) ASC], false, 0 TungstenExchange hashpartitioning(coalesce(id#0L,0),200) ConvertToUnsafe Scan PhysicalRDD[id#0L] TungstenSort [coalesce(id#1L,0) ASC], false, 0 TungstenExchange hashpartitioning(coalesce(id#1L,0),200) ConvertToUnsafe Scan PhysicalRDD[id#1L] ``` Author: Davies Liu <davies@databricks.com> Closes #9120 from davies/null_safe.	2015-10-20 13:40:24 -07:00
Rishabh Bhardwaj	5966817941	[SPARK-11180][SQL] Support BooleanType in DataFrame.na.fill Added support for boolean types in fill and replace methods Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9166 from rishabhbhardwaj/master.	2015-10-19 14:38:58 -07:00
Wenchen Fan	7893cd95db	[SPARK-11119] [SQL] cleanup for unsafe array and map The purpose of this PR is to keep the unsafe format detail only inside the unsafe class itself, so when we use them(like use unsafe array in unsafe map, use unsafe array and map in columnar cache), we don't need to understand the format before use them. change list: * unsafe array's 4-bytes numElements header is now required(was optional), and become a part of unsafe array format. * w.r.t the previous changing, the `sizeInBytes` of unsafe array now counts the 4-bytes header. * unsafe map's format was `[numElements] [key array numBytes] [key array content(without numElements header)] [value array content(without numElements header)]` before, which is a little hacky as it makes unsafe array's header optional. I think saving 4 bytes is not a big deal, so the format is now: `[key array numBytes] [unsafe key array] [unsafe value array]`. * w.r.t the previous changing, the `sizeInBytes` of unsafe map now counts both map's header and array's header. Author: Wenchen Fan <wenchen@databricks.com> Closes #9131 from cloud-fan/unsafe.	2015-10-19 11:02:26 -07:00
zsxwing	beb8bc1ea5	[SPARK-11126][SQL] Fix the potential flaky test The unit test added in #9132 is flaky. This is a follow up PR to add `listenerBus.waitUntilEmpty` to fix it. Author: zsxwing <zsxwing@gmail.com> Closes #9163 from zsxwing/SPARK-11126-follow-up.	2015-10-19 00:06:51 -07:00
zsxwing	94c8fef296	[SPARK-11126][SQL] Fix a memory leak in SQLListener._stageIdToStageMetrics SQLListener adds all stage infos to `_stageIdToStageMetrics`, but only removes stage infos belonging to SQL executions. This PR fixed it by ignoring stages that don't belong to SQL executions. Reported by Terry Hoo in https://www.mail-archive.com/userspark.apache.org/msg38810.html Author: zsxwing <zsxwing@gmail.com> Closes #9132 from zsxwing/SPARK-11126.	2015-10-18 13:51:45 -07:00
tedyu	3895b2113a	[SPARK-11172] Close JsonParser/Generator in test Author: tedyu <yuzhihong@gmail.com> Closes #9157 from tedyu/master.	2015-10-18 02:12:56 -07:00
Koert Kuipers	57f83e36d6	[SPARK-10185] [SQL] Feat sql comma separated paths Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider Author: Koert Kuipers <koert@tresata.com> Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.	2015-10-17 14:56:24 -07:00
Josh Rosen	eb0b4d6e2d	[SPARK-11135] [SQL] Exchange incorrectly skips sorts when existing ordering is non-empty subset of required ordering In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. This bug was introduced in https://github.com/apache/spark/pull/7458, so it affects 1.5.0+. This patch fixes the bug and significantly improves the unit test coverage of Exchange's sort-planning logic. Author: Josh Rosen <joshrosen@databricks.com> Closes #9140 from JoshRosen/SPARK-11135.	2015-10-15 17:36:55 -07:00
Wenchen Fan	6a2359ff1f	[SPARK-10412] [SQL] report memory usage for tungsten sql physical operator https://issues.apache.org/jira/browse/SPARK-10412 some screenshots: ### aggregate: ![screen shot 2015-10-12 at 2 23 11 pm](https://cloud.githubusercontent.com/assets/3182036/10439534/618320a4-70ef-11e5-94d8-62ea7f2d1531.png) ### join ![screen shot 2015-10-12 at 2 23 29 pm](https://cloud.githubusercontent.com/assets/3182036/10439537/6724797c-70ef-11e5-8f75-0cf5cbd42048.png) Author: Wenchen Fan <wenchen@databricks.com> Author: Wenchen Fan <cloud0fan@163.com> Closes #8931 from cloud-fan/viz.	2015-10-15 14:50:58 -07:00
Andrew Or	3b364ff0a4	[SPARK-11078] Ensure spilling tests actually spill #9084 uncovered that many tests that test spilling don't actually spill. This is a follow-up patch to fix that to ensure our unit tests actually catch potential bugs in spilling. The size of this patch is inflated by the refactoring of `ExternalSorterSuite`, which had a lot of duplicate code and logic. Author: Andrew Or <andrew@databricks.com> Closes #9124 from andrewor14/spilling-tests.	2015-10-15 14:50:01 -07:00
Josh Rosen	4ace4f8a9c	[SPARK-11017] [SQL] Support ImperativeAggregates in TungstenAggregate This patch extends TungstenAggregate to support ImperativeAggregate functions. The existing TungstenAggregate operator only supported DeclarativeAggregate functions, which are defined in terms of Catalyst expressions and can be evaluated via generated projections. ImperativeAggregate functions, on the other hand, are evaluated by calling their `initialize`, `update`, `merge`, and `eval` methods. The basic strategy here is similar to how SortBasedAggregate evaluates both types of aggregate functions: use a generated projection to evaluate the expression-based declarative aggregates with dummy placeholder expressions inserted in place of the imperative aggregate function output, then invoke the imperative aggregate functions and target them against the aggregation buffer. The bulk of the diff here consists of code that was copied and adapted from SortBasedAggregate, with some key changes to handle TungstenAggregate's sort fallback path. Author: Josh Rosen <joshrosen@databricks.com> Closes #9038 from JoshRosen/support-interpreted-in-tungsten-agg-final.	2015-10-14 17:27:50 -07:00
Cheng Hao	1baaf2b9bd	[SPARK-10829] [SQL] Filter combine partition key and attribute doesn't work in DataSource scan ```scala withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { withTempPath { dir => val path = s"${dir.getCanonicalPath}/part=1" (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) // If the "part = 1" filter gets pushed down, this query will throw an exception since // "part" is not a valid column in the actual Parquet file checkAnswer( sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"), (2 to 3).map(i => Row(i, i.toString, 1))) } } ``` We expect the result to be: ``` 2,1 3,1 ``` But got ``` 1,1 2,1 3,1 ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8916 from chenghao-intel/partition_filter.	2015-10-14 16:29:32 -07:00
Wenchen Fan	56d7da14ab	[SPARK-10104] [SQL] Consolidate different forms of table identifiers Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to represent table identifiers. We should only have one form and TableIdentifier is the best one because it provides methods to get table name, database name, return unquoted string, and return quoted string. Author: Wenchen Fan <wenchen@databricks.com> Author: Wenchen Fan <cloud0fan@163.com> Closes #8453 from cloud-fan/table-name.	2015-10-14 16:05:37 -07:00
Wenchen Fan	9a430a027f	[SPARK-11068] [SQL] [FOLLOW-UP] move execution listener to util Author: Wenchen Fan <wenchen@databricks.com> Closes #9119 from cloud-fan/callback.	2015-10-14 15:08:13 -07:00
Wenchen Fan	15ff85b316	[SPARK-11068] [SQL] add callback to query execution With this feature, we can track the query plan, time cost, exception during query execution for spark users. Author: Wenchen Fan <cloud0fan@163.com> Closes #9078 from cloud-fan/callback.	2015-10-13 17:59:32 -07:00
Wenchen Fan	e170c22160	[SPARK-11032] [SQL] correctly handle having We should not stop resolving having when the having condtion is resolved, or something like `count(1)` will crash. Author: Wenchen Fan <cloud0fan@163.com> Closes #9105 from cloud-fan/having.	2015-10-13 17:11:22 -07:00
Andrew Or	b3ffac5178	[SPARK-10983] Unified memory manager This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced: - spark.memory.fraction (default 0.75): fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. - spark.memory.storageFraction (default 0.5): size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region. - spark.memory.useLegacyMode (default false): whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility. For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000. Author: Andrew Or <andrew@databricks.com> Closes #9084 from andrewor14/unified-memory-manager.	2015-10-13 13:49:59 -07:00
Davies Liu	d0cc79ccd0	[SPARK-11030] [SQL] share the SQLTab across sessions The SQLTab will be shared by multiple sessions. If we create multiple independent SQLContexts (not using newSession()), will still see multiple SQLTabs in the Spark UI. Author: Davies Liu <davies@databricks.com> Closes #9048 from davies/sqlui.	2015-10-13 09:57:53 -07:00
Davies Liu	c4da5345a0	[SPARK-10990] [SPARK-11018] [SQL] improve unrolling of complex types This PR improve the unrolling and read of complex types in columnar cache: 1) Using UnsafeProjection to do serialization of complex types, so they will not be serialized three times (two for actualSize) 2) Copy the bytes from UnsafeRow/UnsafeArrayData to ByteBuffer directly, avoiding the immediate byte[] 3) Using the underlying array in ByteBuffer to create UTF8String/UnsafeRow/UnsafeArrayData without copy. Combine these optimizations, we can reduce the unrolling time from 25s to 21s (20% less), reduce the scanning time from 3.5s to 2.5s (28% less). ``` df = sqlContext.read.parquet(path) t = time.time() df.cache() df.count() print 'unrolling', time.time() - t for i in range(10): t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` The schema is ``` root \|-- a: struct (nullable = true) \| \|-- b: long (nullable = true) \| \|-- c: string (nullable = true) \|-- d: array (nullable = true) \| \|-- element: long (containsNull = true) \|-- e: map (nullable = true) \| \|-- key: long \| \|-- value: string (valueContainsNull = true) ``` Now the columnar cache depends on that UnsafeProjection support all the data types (including UDT), this PR also fix that. Author: Davies Liu <davies@databricks.com> Closes #9016 from davies/complex2.	2015-10-12 21:12:59 -07:00
Yin Huai	8a354bef55	[SPARK-11042] [SQL] Add a mechanism to ban creating multiple root SQLContexts/HiveContexts in a JVM https://issues.apache.org/jira/browse/SPARK-11042 Author: Yin Huai <yhuai@databricks.com> Closes #9058 from yhuai/SPARK-11042.	2015-10-12 13:50:34 -07:00
Cheng Lian	64b1d00e1a	[SPARK-11007] [SQL] Adds dictionary aware Parquet decimal converters For Parquet decimal columns that are encoded using plain-dictionary encoding, we can make the upper level converter aware of the dictionary, so that we can pre-instantiate all the decimals to avoid duplicated instantiation. Note that plain-dictionary encoding isn't available for `FIXED_LEN_BYTE_ARRAY` for Parquet writer version `PARQUET_1_0`. So currently only decimals written as `INT32` and `INT64` can benefit from this optimization. Author: Cheng Lian <lian@databricks.com> Closes #9040 from liancheng/spark-11007.decimal-converter-dict-support.	2015-10-12 10:17:19 -07:00
Rick Hillegas	12b7191d20	[SPARK-10855] [SQL] Add a JDBC dialect for Apache Derby marmbrus rxin This patch adds a JdbcDialect class, which customizes the datatype mappings for Derby backends. The patch also adds unit tests for the new dialect, corresponding to the existing tests for other JDBC dialects. JDBCSuite runs cleanly for me with this patch. So does JDBCWriteSuite, although it produces noise as described here: https://issues.apache.org/jira/browse/SPARK-10890 This patch is my original work, which I license to the ASF. I am a Derby contributor, so my ICLA is on file under SVN id "rhillegas": http://people.apache.org/committer-index.html Touches the following files: --------------------------------- org.apache.spark.sql.jdbc.JdbcDialects Adds a DerbyDialect. --------------------------------- org.apache.spark.sql.jdbc.JDBCSuite Adds unit tests for the new DerbyDialect. Author: Rick Hillegas <rhilleg@us.ibm.com> Closes #8982 from rick-ibm/b_10855.	2015-10-09 13:36:51 -07:00
Andrew Or	67fbecbf32	[SPARK-10956] Common MemoryManager interface for storage and execution This patch introduces a `MemoryManager` that is the central arbiter of how much memory to grant to storage and execution. This patch is primarily concerned only with refactoring while preserving the existing behavior as much as possible. This is the first step away from the existing rigid separation of storage and execution memory, which has several major drawbacks discussed on the [issue](https://issues.apache.org/jira/browse/SPARK-10956). It is the precursor of a series of patches that will attempt to address those drawbacks. Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #9000 from andrewor14/memory-manager.	2015-10-08 21:44:59 -07:00
Davies Liu	3390b400d0	[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session. A new session of SQLContext could be created by: 1) create an new SQLContext 2) call newSession() on existing SQLContext For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession). CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache. Added jars are still shared by all the sessions, because SparkContext does not support sessions. cc marmbrus yhuai rxin Author: Davies Liu <davies@databricks.com> Closes #8909 from davies/sessions.	2015-10-08 17:34:24 -07:00
Reynold Xin	84ea287178	[SPARK-10914] UnsafeRow serialization breaks when two machines have different Oops size. UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM). To reproduce, launch Spark using MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" And then run the following scala> sql("select 1 xx").collect() Author: Reynold Xin <rxin@databricks.com> Closes #9030 from rxin/SPARK-10914.	2015-10-08 17:25:14 -07:00
Cheng Lian	02149ff08e	[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes. Major changes include: 1. Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport` - Writes Parquet data using standard layout defined in parquet-format Specifically, we are now writing ... - ... arrays and maps in standard 3-level structure with proper annotations and field names - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback - Supports legacy mode which is compatible with Spark 1.4 and prior versions The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`. - Eliminates per value data type dispatching costs via prebuilt composed writer functions 1. Cleans up the last pieces of old Parquet support code As pointed out by rxin previously, we probably want to rename all those `Catalyst` Parquet classes to `Parquet` for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one. Author: Cheng Lian <lian@databricks.com> Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.	2015-10-08 16:18:35 -07:00
Josh Rosen	2816c89b6a	[SPARK-10988] [SQL] Reduce duplication in Aggregate2's expression rewriting logic In `aggregate/utils.scala`, there is a substantial amount of duplication in the expression-rewriting logic. As a prerequisite to supporting imperative aggregate functions in `TungstenAggregate`, this patch refactors this file so that the same expression-rewriting logic is used for both `SortAggregate` and `TungstenAggregate`. In order to allow both operators to use the same rewriting logic, `TungstenAggregationIterator. generateResultProjection()` has been updated so that it first evaluates all declarative aggregate functions' `evaluateExpression`s and writes the results into a temporary buffer, and then uses this temporary buffer and the grouping expressions to evaluate the final resultExpressions. This matches the logic in SortAggregateIterator, where this two-pass approach is necessary in order to support imperative aggregates. If this change turns out to cause performance regressions, then we can look into re-implementing the single-pass evaluation in a cleaner way as part of a followup patch. Since the rewriting logic is now shared across both operators, this patch also extracts that logic and places it in `SparkStrategies`. This makes the rewriting logic a bit easier to follow, I think. Author: Josh Rosen <joshrosen@databricks.com> Closes #9015 from JoshRosen/SPARK-10988.	2015-10-08 14:56:27 -07:00
Yin Huai	82d275f27c	[SPARK-10887] [SQL] Build HashedRelation outside of HashJoinNode. This PR refactors `HashJoinNode` to take a existing `HashedRelation`. So, we can reuse this node for both `ShuffledHashJoin` and `BroadcastHashJoin`. https://issues.apache.org/jira/browse/SPARK-10887 Author: Yin Huai <yhuai@databricks.com> Closes #8953 from yhuai/SPARK-10887.	2015-10-08 11:56:44 -07:00
0x0FFF	b8f849b546	[SPARK-7869][SQL] Adding Postgres JSON and JSONb data types support This PR addresses [SPARK-7869](https://issues.apache.org/jira/browse/SPARK-7869) Before the patch, attempt to load the table from Postgres with JSON/JSONb datatype caused error `java.sql.SQLException: Unsupported type 1111` Postgres data types JSON and JSONb are now mapped to String on Spark side thus they can be loaded into DF and processed on Spark side Example Postgres: ``` create table test_json (id int, value json); create table test_jsonb (id int, value jsonb); insert into test_json (id, value) values (1, '{"field1":"value1","field2":"value2","field3":[1,2,3]}'::json), (2, '{"field1":"value3","field2":"value4","field3":[4,5,6]}'::json), (3, '{"field3":"value5","field4":"value6","field3":[7,8,9]}'::json); insert into test_jsonb (id, value) values (4, '{"field1":"value1","field2":"value2","field3":[1,2,3]}'::jsonb), (5, '{"field1":"value3","field2":"value4","field3":[4,5,6]}'::jsonb), (6, '{"field3":"value5","field4":"value6","field3":[7,8,9]}'::jsonb); ``` PySpark: ``` >>> import json >>> df1 = sqlContext.read.jdbc("jdbc:postgresql://127.0.0.1:5432/test?user=testuser", "test_json") >>> df1.map(lambda x: (x.id, json.loads(x.value))).map(lambda (id, value): (id, value.get('field3'))).collect() [(1, [1, 2, 3]), (2, [4, 5, 6]), (3, [7, 8, 9])] >>> df2 = sqlContext.read.jdbc("jdbc:postgresql://127.0.0.1:5432/test?user=testuser", "test_jsonb") >>> df2.map(lambda x: (x.id, json.loads(x.value))).map(lambda (id, value): (id, value.get('field1'))).collect() [(4, u'value1'), (5, u'value3'), (6, None)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8948 from 0x0FFF/SPARK-7869.	2015-10-07 23:12:35 -07:00
Davies Liu	075a0b6582	[SPARK-10917] [SQL] improve performance of complex type in columnar cache This PR improve the performance of complex types in columnar cache by using UnsafeProjection instead of KryoSerializer. A simple benchmark show that this PR could improve the performance of scanning a cached table with complex columns by 15x (comparing to Spark 1.5). Here is the code used to benchmark: ``` df = sc.range(1<<23).map(lambda i: Row(a=Row(b=i, c=str(i)), d=range(10), e=dict(zip(range(10), [str(i) for i in range(10)])))).toDF() df.write.parquet("table") ``` ``` df = sqlContext.read.parquet("table") df.cache() df.count() t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` Author: Davies Liu <davies@databricks.com> Closes #8971 from davies/complex.	2015-10-07 15:58:07 -07:00
Josh Rosen	a9ecd06149	[SPARK-10941] [SQL] Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity This patch refactors several of the Aggregate2 interfaces in order to improve code clarity. The biggest change is a refactoring of the `AggregateFunction2` class hierarchy. In the old code, we had a class named `AlgebraicAggregate` that inherited from `AggregateFunction2`, added a new set of methods, then banned the use of the inherited methods. I found this to be fairly confusing because. If you look carefully at the existing code, you'll see that subclasses of `AggregateFunction2` fall into two disjoint categories: imperative aggregation functions which directly extended `AggregateFunction2` and declarative, expression-based aggregate functions which extended `AlgebraicAggregate`. In order to make this more explicit, this patch refactors things so that `AggregateFunction2` is a sealed abstract class with two subclasses, `ImperativeAggregateFunction` and `ExpressionAggregateFunction`. The superclass, `AggregateFunction2`, now only contains methods and fields that are common to both subclasses. After making this change, I updated the various AggregationIterator classes to comply with this new naming scheme. I also performed several small renamings in the aggregate interfaces themselves in order to improve clarity and rewrote or expanded a number of comments. Author: Josh Rosen <joshrosen@databricks.com> Closes #8973 from JoshRosen/tungsten-agg-comments.	2015-10-07 13:19:49 -07:00
Davies Liu	27ecfe61f0	[SPARK-10938] [SQL] remove typeId in columnar cache This PR remove the typeId in columnar cache, it's not needed anymore, it also remove DATE and TIMESTAMP (use INT/LONG instead). Author: Davies Liu <davies@databricks.com> Closes #8989 from davies/refactor_cache.	2015-10-06 08:45:31 -07:00
Wenchen Fan	a609eb20d9	[SPARK-10934] [SQL] handle hashCode of unsafe array correctly `Murmur3_x86_32.hashUnsafeWords` only accepts word-aligned bytes, but unsafe array is not. Author: Wenchen Fan <cloud0fan@163.com> Closes #8987 from cloud-fan/hash.	2015-10-05 17:31:54 -07:00
Cheng Lian	01cd688f52	[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC We introduced SQL option `spark.sql.parquet.followParquetFormatSpec` while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files. This option defaults to `false` and is marked as a non-public option (`isPublic = false`) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to `spark.sql.parquet.writeLegacyFormat`, and invert its default value (the two option names have opposite meanings). Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format. Author: Cheng Lian <lian@databricks.com> Closes #8566 from liancheng/spark-10400/deprecate-follow-parquet-format-spec.	2015-10-01 17:23:27 -07:00
Cheng Hao	4d8c7c6d1c	[SPARK-10865] [SPARK-10866] [SQL] Fix bug of ceil/floor, which should returns long instead of the Double type Floor & Ceiling function should returns Long type, rather than Double. Verified with MySQL & Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #8933 from chenghao-intel/ceiling.	2015-10-01 11:48:15 -07:00
Nathan Howell	89ea0041ae	[SPARK-9617] [SQL] Implement json_tuple This is an implementation of Hive's `json_tuple` function using Jackson Streaming. Author: Nathan Howell <nhowell@godaddy.com> Closes #7946 from NathanHowell/SPARK-9617.	2015-09-30 15:33:12 -07:00
Reynold Xin	03cca5dce2	[SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row. Author: Reynold Xin <rxin@databricks.com> Closes #8900 from rxin/SPARK-10770-1.	2015-09-30 14:36:54 -04:00
Davies Liu	ea02e5513a	[SPARK-10859] [SQL] fix stats of StringType in columnar cache The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats. cc yhuai Author: Davies Liu <davies@databricks.com> Closes #8929 from davies/pushdown_string.	2015-09-28 14:40:40 -07:00
Holden Karau	8ecba3e86e	[SPARK-10720] [SQL] [JAVA] Add a java wrapper to create a dataframe from a local list of java beans Similar to SPARK-10630 it would be nice if Java users didn't have to parallelize there data explicitly (as Scala users already can skip). Issue came up in http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work Author: Holden Karau <holden@pigscanfly.ca> Closes #8879 from holdenk/SPARK-10720-add-a-java-wrapper-to-create-a-dataframe-from-a-local-list-of-java-beans.	2015-09-27 21:16:15 +01:00
Wenchen Fan	418e5e4cbd	[SPARK-10741] [SQL] Hive Query Having/OrderBy against Parquet table is not working https://issues.apache.org/jira/browse/SPARK-10741 I choose the second approach: do not change output exprIds when convert MetastoreRelation to LogicalRelation Author: Wenchen Fan <cloud0fan@163.com> Closes #8889 from cloud-fan/hot-bug.	2015-09-27 09:08:38 -07:00
Andrew Or	83f6f54d12	[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2) This patch reverts most of the changes in a previous fix #8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now *we will simply not track the pointer array*. (Note that elsewhere we already don't track the pointer array, e.g. [here](`a18208047f/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java (L88)`)) Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes #8888 from andrewor14/dont-track-pointer-array.	2015-09-23 19:34:31 -07:00
Josh Rosen	a18208047f	[SPARK-10403] Allow UnsafeRowSerializer to work with tungsten-sort ShuffleManager This patch attempts to fix an issue where Spark SQL's UnsafeRowSerializer was incompatible with the `tungsten-sort` ShuffleManager. Author: Josh Rosen <joshrosen@databricks.com> Closes #8873 from JoshRosen/SPARK-10403.	2015-09-23 11:31:01 -07:00
Yin Huai	5aea987c90	[SPARK-10737] [SQL] When using UnsafeRows, SortMergeJoin may return wrong results https://issues.apache.org/jira/browse/SPARK-10737 Author: Yin Huai <yhuai@databricks.com> Closes #8854 from yhuai/SMJBug.	2015-09-22 13:31:35 -07:00
Wenchen Fan	5017c685f4	[SPARK-10740] [SQL] handle nondeterministic expressions correctly for set operations https://issues.apache.org/jira/browse/SPARK-10740 Author: Wenchen Fan <cloud0fan@163.com> Closes #8858 from cloud-fan/non-deter.	2015-09-22 12:14:59 -07:00
Liang-Chi Hsieh	1fcefef069	[SPARK-10446][SQL] Support to specify join type when calling join with usingColumns JIRA: https://issues.apache.org/jira/browse/SPARK-10446 Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8600 from viirya/usingcolumns_df.	2015-09-21 23:46:00 -07:00
Ewan Leith	781b21ba2a	[SPARK-10419] [SQL] Adding SQLServer support for datetimeoffset types to JdbcDialects Reading from Microsoft SQL Server over jdbc fails when the table contains datetimeoffset types. This patch registers a SQLServer JDBC Dialect that maps datetimeoffset to a String, as Microsoft suggest. Author: Ewan Leith <ewan.leith@realitymine.com> Closes #8575 from realitymine-coordinator/sqlserver.	2015-09-21 23:43:20 -07:00
Yin Huai	0494c80ef5	[SPARK-10495] [SQL] Read date values in JSON data stored by Spark 1.5.0. https://issues.apache.org/jira/browse/SPARK-10681 Author: Yin Huai <yhuai@databricks.com> Closes #8806 from yhuai/SPARK-10495.	2015-09-21 18:06:45 -07:00
Holden Karau	362539f8d9	[SPARK-10630] [SQL] Add a createDataFrame API that takes in a java list It would be nice to support creating a DataFrame directly from a Java List of Row. Author: Holden Karau <holden@pigscanfly.ca> Closes #8779 from holdenk/SPARK-10630-create-DataFrame-from-Java-List.	2015-09-21 13:33:10 -07:00
Josh Rosen	2117eea71e	[SPARK-10710] Remove ability to disable spilling in core and SQL It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`. This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling. Author: Josh Rosen <joshrosen@databricks.com> Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.	2015-09-19 21:40:21 -07:00
Andrew Or	7ff8d68cc1	[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array When `TungstenAggregation` hits memory pressure, it switches from hash-based to sort-based aggregation in-place. However, in the process we try to allocate the pointer array for writing to the new `UnsafeExternalSorter` before actually freeing the memory from the hash map. This lead to the following exception: ``` java.io.IOException: Could not acquire 65536 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:126) at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) ``` Author: Andrew Or <andrew@databricks.com> Closes #8827 from andrewor14/allocate-pointer-array.	2015-09-18 23:58:25 -07:00
Yijie Shen	c6f8135ee5	[SPARK-10539] [SQL] Project should not be pushed down through Intersect or Except #8742 Intersect and Except are both set operators and they use the all the columns to compare equality between rows. When pushing their Project parent down, the relations they based on would change, therefore not an equivalent transformation. JIRA: https://issues.apache.org/jira/browse/SPARK-10539 I added some comments based on the fix of https://github.com/apache/spark/pull/8742. Author: Yijie Shen <henry.yijieshen@gmail.com> Author: Yin Huai <yhuai@databricks.com> Closes #8823 from yhuai/fix_set_optimization.	2015-09-18 13:20:13 -07:00
Yin Huai	aad644fbe2	[SPARK-10639] [SQL] Need to convert UDAF's result from scala to sql type https://issues.apache.org/jira/browse/SPARK-10639 Author: Yin Huai <yhuai@databricks.com> Closes #8788 from yhuai/udafConversion.	2015-09-17 11:14:52 -07:00
sureshthalamati	64c29afcb7	[SPARK-9078] [SQL] Allow jdbc dialects to override the query used to check the table. Current implementation uses query with a LIMIT clause to find if table already exists. This syntax works only in some database systems. This patch changes the default query to the one that is likely to work on most databases, and adds a new method to the JdbcDialect abstract class to allow dialects to override the default query. I looked at using the JDBC meta data calls, it turns out there is no common way to find the current schema, catalog..etc. There is a new method Connection.getSchema() , but that is available only starting jdk1.7 , and existing jdbc drivers may not have implemented it. Other option was to use jdbc escape syntax clause for LIMIT, not sure on how well this supported in all the databases also. After looking at all the jdbc metadata options my conclusion was most common way is to use the simple select query with 'where 1 =0' , and allow dialects to customize as needed Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #8676 from sureshthalamati/table_exists_spark-9078.	2015-09-15 19:41:38 -07:00
Andrew Or	35a19f3357	[SPARK-10613] [SPARK-10624] [SQL] Reduce LocalNode tests dependency on SQLContext Instead of relying on `DataFrames` to verify our answers, we can just use simple arrays. This significantly simplifies the test logic for `LocalNode`s and reduces a lot of code duplicated from `SparkPlanTest`. This also fixes an additional issue [SPARK-10624](https://issues.apache.org/jira/browse/SPARK-10624) where the output of `TakeOrderedAndProjectNode` is not actually ordered. Author: Andrew Or <andrew@databricks.com> Closes #8764 from andrewor14/sql-local-tests-cleanup.	2015-09-15 17:24:32 -07:00
Josh Rosen	38700ea40c	[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop. This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish). This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code. Author: Josh Rosen <joshrosen@databricks.com> Closes #8544 from JoshRosen/SPARK-10381.	2015-09-15 17:11:21 -07:00
Andrew Or	b6e998634e	[SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions Note: this is for master branch only. The fix for branch-1.5 is at #8721. The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.: ``` (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() } ``` The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path. Author: Andrew Or <andrew@databricks.com> Closes #8710 from andrewor14/concurrent-sql-executions.	2015-09-15 16:45:47 -07:00
Liang-Chi Hsieh	841972e22c	[SPARK-10437] [SQL] Support aggregation expressions in Order By JIRA: https://issues.apache.org/jira/browse/SPARK-10437 If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8599 from viirya/orderby-agg.	2015-09-15 13:33:32 -07:00
zsxwing	217e496444	[SPARK-9996] [SPARK-9997] [SQL] Add local expand and NestedLoopJoin operators This PR is in conflict with #8535 and #8573. Will update this one when they are merged. Author: zsxwing <zsxwing@gmail.com> Closes #8642 from zsxwing/expand-nest-join.	2015-09-14 15:00:27 -07:00
JihongMa	f4a22808e0	[SPARK-6548] Adding stddev to DataFrame functions Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change. Author: JihongMa <linlin200605@gmail.com> Author: Jihong MA <linlin200605@gmail.com> Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com> Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local> Closes #6297 from JihongMA/SPARK-SQL.	2015-09-12 10:17:15 -07:00
Sean Owen	22730ad54d	[SPARK-10547] [TEST] Streamline / improve style of Java API tests Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order Author: Sean Owen <sowen@cloudera.com> Closes #8706 from srowen/SPARK-10547.	2015-09-12 10:40:10 +01:00
Andrew Or	c2af42b5f3	[SPARK-9990] [SQL] Local hash join follow-ups 1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method 2. Add tests for this Author: Andrew Or <andrew@databricks.com> Closes #8708 from andrewor14/local-hash-join-follow-up.	2015-09-11 15:01:37 -07:00
zsxwing	e626ac5f5c	[SPARK-9992] [SPARK-9994] [SPARK-9998] [SQL] Implement the local TopK, sample and intersect operators This PR is in conflict with #8535. I will update this one when #8535 gets merged. Author: zsxwing <zsxwing@gmail.com> Closes #8573 from zsxwing/more-local-operators.	2015-09-11 15:00:13 -07:00
Cheng Lian	e1d7f64296	[SPARK-10472] [SQL] Fixes DataType.typeName for UDT Before this fix, `MyDenseVectorUDT.typeName` gives `mydensevecto`, which is not desirable. Author: Cheng Lian <lian@databricks.com> Closes #8640 from liancheng/spark-10472/udt-type-name.	2015-09-11 18:26:56 +08:00
zsxwing	d88abb7e21	[SPARK-9990] [SQL] Create local hash join operator This PR includes the following changes: - Add SQLConf to LocalNode - Add HashJoinNode - Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join. Author: zsxwing <zsxwing@gmail.com> Closes #8535 from zsxwing/SPARK-9990.	2015-09-10 12:06:49 -07:00
Cheng Hao	e048111376	[SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8635 from chenghao-intel/unsafe_spill.	2015-09-10 11:48:43 -07:00
Cheng Lian	49da38e5f7	[SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #8509 for master Author: Cheng Lian <lian@databricks.com> Closes #8670 from liancheng/spark-10301/address-pr-comments.	2015-09-10 11:01:08 -07:00
Liang-Chi Hsieh	45de518742	[SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin This PR is based on #8383 , thanks to viirya JIRA: https://issues.apache.org/jira/browse/SPARK-9730 This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner. Closes #8383 Author: Liang-Chi Hsieh <viirya@appier.com> Author: Davies Liu <davies@databricks.com> Closes #8579 from davies/smj_fullouter.	2015-09-09 16:02:27 -07:00
Michael Armbrust	2143d592c8	[HOTFIX] Fix build break caused by #8494 Author: Michael Armbrust <michael@databricks.com> Closes #8659 from marmbrus/testBuildBreak.	2015-09-08 16:51:45 -07:00
Cheng Hao	d637a666d5	[SPARK-10327] [SQL] Cache Table is not working while subquery has alias in its project list ```scala import org.apache.spark.sql.hive.execution.HiveTableScan sql("select key, value, key + 1 from src").registerTempTable("abc") cacheTable("abc") val sparkPlan = sql( """select a.key, b.key, c.key from \|abc a join abc b on a.key=b.key \|join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) // failed assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // failed ``` The actual plan is: ``` == Parsed Logical Plan == 'Project [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)] 'Join Inner, Some(('a.key = 'c.key)) 'Join Inner, Some(('a.key = 'b.key)) 'UnresolvedRelation [abc], Some(a) 'UnresolvedRelation [abc], Some(b) 'UnresolvedRelation [abc], Some(c) == Analyzed Logical Plan == key: int, key: int, key: int Project [key#14,key#61,key#66] Join Inner, Some((key#14 = key#66)) Join Inner, Some((key#14 = key#61)) Subquery a Subquery abc Project [key#14,value#15,(key#14 + 1) AS _c2#16] MetastoreRelation default, src, None Subquery b Subquery abc Project [key#61,value#62,(key#61 + 1) AS _c2#58] MetastoreRelation default, src, None Subquery c Subquery abc Project [key#66,value#67,(key#66 + 1) AS _c2#63] MetastoreRelation default, src, None == Optimized Logical Plan == Project [key#14,key#61,key#66] Join Inner, Some((key#14 = key#66)) Project [key#14,key#61] Join Inner, Some((key#14 = key#61)) Project [key#14] InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc) Project [key#61] MetastoreRelation default, src, None Project [key#66] MetastoreRelation default, src, None == Physical Plan == TungstenProject [key#14,key#61,key#66] BroadcastHashJoin [key#14], [key#66], BuildRight TungstenProject [key#14,key#61] BroadcastHashJoin [key#14], [key#61], BuildRight ConvertToUnsafe InMemoryColumnarTableScan [key#14], (InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc)) ConvertToUnsafe HiveTableScan [key#61], (MetastoreRelation default, src, None) ConvertToUnsafe HiveTableScan [key#66], (MetastoreRelation default, src, None) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8494 from chenghao-intel/weird_cache.	2015-09-08 16:16:50 -07:00
Wenchen Fan	5fd57955ef	[SPARK-10316] [SQL] respect nondeterministic expressions in PhysicalOperation We did a lot of special handling for non-deterministic expressions in `Optimizer`. However, `PhysicalOperation` just collects all Projects and Filters and mess it up. We should respect the operators order caused by non-deterministic expressions in `PhysicalOperation`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8486 from cloud-fan/fix.	2015-09-08 12:05:41 -07:00
Cheng Lian	bca8c072bd	[SPARK-10434] [SQL] Fixes Parquet schema of arrays that may contain null To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array". Please refer to [SPARK-10434] [1] for more details. [1]: https://issues.apache.org/jira/browse/SPARK-10434 Author: Cheng Lian <lian@databricks.com> Closes #8586 from liancheng/spark-10434/fix-parquet-array-type.	2015-09-05 17:50:12 +08:00
Cheng Lian	6c751940ea	[HOTFIX] [SQL] Fixes compilation error Jenkins master builders are currently broken by a merge conflict between PR #8584 and PR #8155. Author: Cheng Lian <lian@databricks.com> Closes #8614 from liancheng/hotfix/fix-pr-8155-8584-conflict.	2015-09-04 22:57:52 -10:00
Yin Huai	47058ca5db	[SPARK-9925] [SQL] [TESTS] Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests This PR fix the failed test and conflict for #8155 https://issues.apache.org/jira/browse/SPARK-9925 Closes #8155 Author: Yin Huai <yhuai@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #8602 from davies/shuffle_partitions.	2015-09-04 18:58:25 -07:00
Wenchen Fan	c3c0e431a6	[SPARK-10176] [SQL] Show partially analyzed plans when checkAnswer fails to analyze This PR takes over https://github.com/apache/spark/pull/8389. This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests. In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class. I propose we refactor as follows: 1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`. 2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`) Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8584 from cloud-fan/cleanupTests.	2015-09-04 15:17:37 -07:00
Yin Huai	03f3e91ff2	[SPARK-10422] [SQL] String column in InMemoryColumnarCache needs to override clone method https://issues.apache.org/jira/browse/SPARK-10422 Author: Yin Huai <yhuai@databricks.com> Closes #8578 from yhuai/SPARK-10422.	2015-09-02 21:00:13 -07:00
Wenchen Fan	fc48307797	[SPARK-10389] [SQL] support order by non-attribute grouping expression on Aggregate For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8548 from cloud-fan/support-order-by-non-attribute.	2015-09-02 11:32:27 -07:00
Wenchen Fan	56c4c172e9	[SPARK-10034] [SQL] add regression test for Sort on Aggregate Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name. However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8231 from cloud-fan/sort-agg.	2015-09-02 11:13:17 -07:00
Cheng Lian	391e6be0ae	[SPARK-10301] [SQL] Fixes schema merging for nested structs This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here. When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons: 1. Requested schema must conform to the real schema of the physical file to be read. This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231. 1. Support for schema merging. A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema ``` message root { required group f0 { required group f00 { required int32 f000; required binary f001 (UTF8); } } } ``` we may request for column paths defined in the following schema: ``` message root { required group f0 { required group f00 { required binary f001 (UTF8); required float f002; } } optional double f1; } ``` Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`. The good news is that Parquet handles non-existing column paths properly and always returns null for them. 1. The map from `StructType` to `MessageType` is a one-to-many map. This is the most unfortunate part. Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema: ``` message m0 { repeated int32 f; } ``` while parquet-avro generates another version: ``` message m1 { required group f (LIST) { repeated int32 array; } } ``` and parquet-thrift spills this: ``` message m1 { required group f (LIST) { repeated int32 f_tuple; } } ``` All of them can be mapped to the following _unique_ Catalyst schema: ``` StructType( StructField( "f", ArrayType(IntegerType, containsNull = false), nullable = false)) ``` This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`. In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005]. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way. For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`: For a leaf column path `c` in `cs`: - if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`; - otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`; - no other column paths should exist in `ps'`. Then comes the most tedious part: > Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`? Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are: 1. the standard structure of nested types, and 1. cases defined in all backwards-compatibility rules for `LIST` and `MAP`. The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively. The column path selection algorithm is implemented in `clipParquetGroupFields()`. With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by [this test case] [test-case]. [spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301 [spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005 [test-case]: `38644d8a45 (diff-a9b98e28ce3ae30641829dffd1173be2R26)` Author: Cheng Lian <lian@databricks.com> Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.	2015-09-01 16:52:59 +08:00
sureshthalamati	a2d5c72091	[SPARK-10170] [SQL] Add DB2 JDBC dialect support. Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean. This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #8393 from sureshthalamati/db2_dialect_spark-10170.	2015-08-31 12:39:58 -07:00
Feynman Liang	8694c3ad7d	[SPARK-10351] [SQL] Fixes UTF8String.fromAddress to handle off-heap memory CC rxin marmbrus Author: Feynman Liang <fliang@databricks.com> Closes #8523 from feynmanliang/SPARK-10351.	2015-08-30 23:12:56 -07:00
zsxwing	13f5f8ec97	[SPARK-9986] [SPARK-9991] [SPARK-9993] [SQL] Create a simple test framework for local operators This PR includes the following changes: - Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode. - Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993) Author: zsxwing <zsxwing@gmail.com> Closes #8464 from zsxwing/local-execution.	2015-08-29 18:10:44 -07:00
Michael Armbrust	5c3d16a9b9	[SPARK-10344] [SQL] Add tests for extraStrategies Actually using this API requires access to a lot of classes that we might make private by accident. I've added some tests to prevent this. Author: Michael Armbrust <michael@databricks.com> Closes #8516 from marmbrus/extraStrategiesTests.	2015-08-29 13:26:01 -07:00
Cheng Lian	24ffa85c00	[SPARK-10289] [SQL] A direct write API for testing Parquet This PR introduces a direct write API for testing Parquet. It's a DSL flavored version of the [`writeDirect` method] [1] comes with parquet-avro testing code. With this API, it's much easier to construct arbitrary Parquet structures. It's especially useful when adding regression tests for various compatibility corner cases. Sample usage of this API can be found in the new test case added in `ParquetThriftCompatibilitySuite`. [1]: https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972 Author: Cheng Lian <lian@databricks.com> Closes #8454 from liancheng/spark-10289/parquet-testing-direct-write-api.	2015-08-29 13:24:32 -07:00
Davies Liu	bb7f352393	[SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array. Author: Davies Liu <davies@databricks.com> Closes #8492 from davies/fix_in.	2015-08-28 14:38:20 -07:00
Josh Rosen	d3f87dc394	[SPARK-10325] Override hashCode() for public Row This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8500 from JoshRosen/SPARK-10325 and squashes the following commits: 51ffea1 [Josh Rosen] Override hashCode() for public Row.	2015-08-28 11:51:42 -07:00
Yin Huai	b3dd569ad4	[SPARK-10287] [SQL] Fixes JSONRelation refreshing on read path https://issues.apache.org/jira/browse/SPARK-10287 After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet). Author: Yin Huai <yhuai@databricks.com> Closes #8469 from yhuai/jsonRefresh.	2015-08-27 16:11:25 -07:00
Davies Liu	7467b52ed0	[SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive) Follow the rule in Hive for decimal division. see `ac755ebe26/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java (L113)` cc chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #8415 from davies/decimal_div2.	2015-08-25 15:20:24 -07:00
Davies Liu	ec89bd840a	[SPARK-10245] [SQL] Fix decimal literals with precision < scale In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal. Author: Davies Liu <davies@databricks.com> Closes #8428 from davies/smaller_decimal.	2015-08-25 14:55:34 -07:00
Sean Owen	69c9c17716	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.	2015-08-25 12:33:13 +01:00
Josh Rosen	7bc9a8c624	[SPARK-10195] [SQL] Data sources Filter should not expose internal types Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties. This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions. Author: Josh Rosen <joshrosen@databricks.com> Closes #8403 from JoshRosen/datasources-internal-vs-external-types.	2015-08-25 01:06:36 -07:00
Daoyuan Wang	3c462f5d87	[SPARK-10130] [SQL] type coercion for IF should have children resolved first Type coercion for IF should have children resolved first, or we could meet unresolved exception. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8331 from adrian-wang/spark10130.	2015-08-21 12:21:51 -07:00
Cheng Lian	85f9a61357	[SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`. Author: Cheng Lian <lian@databricks.com> Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array.	2015-08-20 11:00:29 -07:00
Yin Huai	43e0135421	[SPARK-10092] [SQL] Multi-DB support follow up. https://issues.apache.org/jira/browse/SPARK-10092 This pr is a follow-up one for Multi-DB support. It has the following changes: * `HiveContext.refreshTable` now accepts `dbName.tableName`. * `HiveContext.analyze` now accepts `dbName.tableName`. * `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateTempTableUsing`, `CreateTempTableUsingAsSelect`, `CreateMetastoreDataSource`, and `CreateMetastoreDataSourceAsSelect` all take `TableIdentifier` instead of the string representation of table name. * When you call `saveAsTable` with a specified database, the data will be saved to the correct location. * Explicitly do not allow users to create a temporary with a specified database name (users cannot do it before). * When we save table to metastore, we also check if db name and table name can be accepted by hive (using `MetaStoreUtils.validateName`). Author: Yin Huai <yhuai@databricks.com> Closes #8324 from yhuai/saveAsTableDB.	2015-08-20 15:30:31 +08:00
hyukjinkwon	ba5f7e1842	[SPARK-10035] [SQL] Parquet filters does not process EqualNullSafe filter. As I talked with Lian, 1. I added EquelNullSafe to ParquetFilters - It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison. 2. Updated the test code (ParquetFilterSuite) - Convert catalyst.Expression to sources.Filter - Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy - Added EquelNullSafe comparison 3. Removed deprecated createFilter for catalyst.Expression Author: hyukjinkwon <gurwls223@gmail.com> Author: 권혁진 <gurwls223@gmail.com> Closes #8275 from HyukjinKwon/master.	2015-08-20 08:13:25 +08:00
Davies Liu	1f4c4fe6df	[SPARK-10090] [SQL] fix decimal scale of division We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow. Author: Davies Liu <davies@databricks.com> Closes #8287 from davies/decimal_division.	2015-08-19 14:03:47 -07:00
Davies Liu	e05da5cb5e	[SPARK-10107] [SQL] fix NPE in format_number Author: Davies Liu <davies@databricks.com> Closes #8305 from davies/format_number.	2015-08-19 13:43:04 -07:00
Reynold Xin	1ff0580eda	[SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors & fix UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #8295 from rxin/SPARK-10096.	2015-08-18 22:08:15 -07:00
Cheng Lian	a5b5b93659	[SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian <lian@databricks.com> Closes #8168 from liancheng/spark-9939/use-java-process-api.	2015-08-19 11:21:46 +08:00
Michael Armbrust	80cb25b228	[SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust <michael@databricks.com> Closes #8281 from marmbrus/binaryCompat.	2015-08-18 13:50:51 -07:00
Wenchen Fan	a4acdabb10	[SPARK-9950] [SQL] Wrong Analysis Error for grouping/aggregating on struct fields This issue has been fixed by https://github.com/apache/spark/pull/8215, this PR added regression test for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8222 from cloud-fan/minor and squashes the following commits: 0bbfb1c [Wenchen Fan] fix style... 7e2d8d9 [Wenchen Fan] add test	2015-08-17 11:36:18 -07:00
Cheng Lian	76c155dd44	[SPARK-7837] [SQL] Avoids double closing output writers when commitTask() fails When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice. Author: Cheng Lian <lian@databricks.com> Closes #8236 from liancheng/spark-7837/double-closing.	2015-08-18 00:59:05 +08:00
Cheng Lian	ae2370e72f	[SPARK-10005] [SQL] Fixes schema merging for nested structs In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled. For example, the schema of a Parquet file to be read can be: ``` message individual { required group f1 { optional binary f11 (utf8); } } ``` while the global schema is: ``` message global { required group f1 { optional binary f11 (utf8); optional int32 f12; } } ``` This PR fixes this issue by padding missing fields when creating actual converters. Author: Cheng Lian <lian@databricks.com> Closes #8228 from liancheng/spark-10005/nested-schema-merging.	2015-08-16 10:17:58 -07:00
Wenchen Fan	570567258b	[SPARK-9955] [SQL] correct error message for aggregate We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8203 from cloud-fan/error-msg and squashes the following commits: 1c67ca7 [Wenchen Fan] move test 7593080 [Wenchen Fan] correct error message for aggregate	2015-08-15 14:13:12 -07:00
Wenchen Fan	ec29f2034a	[SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #8215 from marmbrus/pr/7957.	2015-08-14 20:59:54 -07:00
Yin Huai	932b24fd14	[SPARK-9949] [SQL] Fix TakeOrderedAndProject's output. https://issues.apache.org/jira/browse/SPARK-9949 Author: Yin Huai <yhuai@databricks.com> Closes #8179 from yhuai/SPARK-9949.	2015-08-14 17:35:17 -07:00
Andrew Or	ece00566e4	[SPARK-9561] Re-enable BroadcastJoinSuite We can do this now that SPARK-9580 is resolved. Author: Andrew Or <andrew@databricks.com> Closes #8208 from andrewor14/reenable-sql-tests.	2015-08-14 12:37:21 -07:00
Andrew Or	8187b3ae47	[SPARK-9580] [SQL] Replace singletons in SQL tests A fundamental limitation of the existing SQL tests is that there is simply no way to create your own `SparkContext`. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure. This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch all the SQL test files. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111) <!-- Reviewable:end --> Author: Andrew Or <andrew@databricks.com> Closes #8111 from andrewor14/sql-tests-refactor.	2015-08-13 17:42:01 -07:00
Davies Liu	c50f97dafd	[SPARK-9943] [SQL] deserialized UnsafeHashedRelation should be serializable When the free memory in executor goes low, the cached broadcast objects need to serialized into disk, but currently the deserialized UnsafeHashedRelation can't be serialized , fail with NPE. This PR fixes that. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8174 from davies/serialize_hashed.	2015-08-13 17:35:11 -07:00
Yijie Shen	d0b18919d1	[SPARK-9927] [SQL] Revert 8049 since it's pushing wrong filter down I made a mistake in #8049 by casting literal value to attribute's data type, which would cause simply truncate the literal value and push a wrong filter down. JIRA: https://issues.apache.org/jira/browse/SPARK-9927 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8157 from yjshen/rever8049.	2015-08-13 13:33:39 +08:00
Davies Liu	7c35746c91	[SPARK-9827] [SQL] fix fd leak in UnsafeRowSerializer Currently, UnsafeRowSerializer does not close the InputStream, will cause fd leak if the InputStream has an open fd in it. TODO: the fd could still be leaked, if any items in the stream is not consumed. Currently it replies on GC to close the fd in this case. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8116 from davies/fd_leak.	2015-08-12 20:02:55 -07:00
Michael Armbrust	660e6dcff8	[SPARK-9449] [SQL] Include MetastoreRelation's inputFiles Author: Michael Armbrust <michael@databricks.com> Closes #8119 from marmbrus/metastoreInputFiles.	2015-08-12 17:07:29 -07:00
Andrew Or	e0110792ef	[SPARK-9747] [SQL] Avoid starving an unsafe operator in aggregation This is the sister patch to #8011, but for aggregation. In a nutshell: create the `TungstenAggregationIterator` before computing the parent partition. Internally this creates a `BytesToBytesMap` which acquires a page in the constructor as of this patch. This ensures that the aggregation operator is not starved since we reserve at least 1 page in advance. rxin yhuai Author: Andrew Or <andrew@databricks.com> Closes #8038 from andrewor14/unsafe-starve-memory-agg.	2015-08-12 10:08:35 -07:00
Cheng Lian	3ecb379430	[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.	2015-08-12 20:01:34 +08:00
Yijie Shen	9d0822455d	[SPARK-9182] [SQL] Filters are not passed through to jdbc source This PR fixes unable to push filter down to JDBC source caused by `Cast` during pattern matching. While we are comparing columns of different type, there's a big chance we need a cast on the column, therefore not match the pattern directly on Attribute and would fail to push down. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8049 from yjshen/jdbc_pushdown.	2015-08-12 19:54:00 +08:00
Davies Liu	c3e9a120e3	[SPARK-9831] [SQL] fix serialization with empty broadcast Author: Davies Liu <davies@databricks.com> Closes #8117 from davies/fix_serialization and squashes the following commits: d21ac71 [Davies Liu] fix serialization with empty broadcast	2015-08-11 22:45:18 -07:00
Reynold Xin	afa757c98c	[SPARK-9849] [SQL] DirectParquetOutputCommitter qualified name should be backward compatible DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name. Author: Reynold Xin <rxin@databricks.com> Closes #8114 from rxin/SPARK-9849.	2015-08-11 18:08:49 -07:00
hyukjinkwon	00c02728a6	[SPARK-9814] [SQL] EqualNotNull not passing to data sources Author: hyukjinkwon <gurwls223@gmail.com> Author: 권혁진 <gurwls223@gmail.com> Closes #8096 from HyukjinKwon/master.	2015-08-11 14:04:09 -07:00
zsxwing	5831294a7a	[SPARK-9646] [SQL] Add metrics for all join and aggregate operators This PR added metrics for all join and aggregate operators. However, I found the metrics may be confusing in the following two case: 1. The iterator is not totally consumed and the metric values will be less. 2. Recreating the iterators will make metric values look bigger than the size of the input source, such as `CartesianProduct`. Author: zsxwing <zsxwing@gmail.com> Closes #8060 from zsxwing/sql-metrics and squashes the following commits: 40f3fc1 [zsxwing] Mark LongSQLMetric private[metric] to avoid using incorrectly and leak memory b1b9071 [zsxwing] Merge branch 'master' into sql-metrics 4bef25a [zsxwing] Add metrics for SortMergeOuterJoin 95ccfc6 [zsxwing] Merge branch 'master' into sql-metrics 67cb4dd [zsxwing] Add metrics for Project and TungstenProject; remove metrics from PhysicalRDD and LocalTableScan 0eb47d4 [zsxwing] Merge branch 'master' into sql-metrics dd9d932 [zsxwing] Avoid creating new Iterators 589ea26 [zsxwing] Add metrics for all join and aggregate operators	2015-08-11 12:39:13 -07:00
Reynold Xin	d378396f86	[SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform. PlatformDependent.UNSAFE is way too verbose. Author: Reynold Xin <rxin@databricks.com> Closes #8094 from rxin/SPARK-9815 and squashes the following commits: 229b603 [Reynold Xin] [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.	2015-08-11 08:41:06 -07:00
Josh Rosen	91e9389f39	[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join This patch adds a new `SortMergeOuterJoin` operator that performs left and right outer joins using sort merge join. It also refactors `SortMergeJoin` in order to improve performance and code clarity. Along the way, I also performed a couple pieces of minor cleanup and optimization: - Rename the `HashJoin` physical planner rule to `EquiJoinSelection`, since it's also used for non-hash joins. - Rewrite the comment at the top of `HashJoin` to better explain the precedence for choosing join operators. - Update `JoinSuite` to use `SqlTestUtils.withConf` for changing SQLConf settings. This patch incorporates several ideas from adrian-wang's patch, #5717. Closes #5717. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7904) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7904 from JoshRosen/outer-join-smj and squashes 1 commits.	2015-08-10 22:04:41 -07:00
Damian Guy	071bbad5db	[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>". ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists	2015-08-11 12:46:33 +08:00
Reynold Xin	40ed2af587	[SPARK-9763][SQL] Minimize exposure of internal SQL classes. There are a few changes in this pull request: 1. Moved all data sources to execution.datasources, except the public JDBC APIs. 2. In order to maintain backward compatibility from 1, added a backward compatibility translation map in data source resolution. 3. Moved ui and metric package into execution. 4. Added more documentation on some internal classes. 5. Renamed DataSourceRegister.format -> shortName. 6. Added "override" modifier on shortName. 7. Removed IntSQLMetric. Author: Reynold Xin <rxin@databricks.com> Closes #8056 from rxin/SPARK-9763 and squashes the following commits: 9df4801 [Reynold Xin] Removed hardcoded name in test cases. d9babc6 [Reynold Xin] Shorten. e484419 [Reynold Xin] Removed VisibleForTesting. 171b812 [Reynold Xin] MimaExcludes. 2041389 [Reynold Xin] Compile ... 79dda42 [Reynold Xin] Compile. 0818ba3 [Reynold Xin] Removed IntSQLMetric. c46884f [Reynold Xin] Two more fixes. f9aa88d [Reynold Xin] [SPARK-9763][SQL] Minimize exposure of internal SQL classes.	2015-08-10 13:49:23 -07:00
Cheng Lian	e3fef0f9e1	[SPARK-9743] [SQL] Fixes JSONRelation refreshing PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: `ebfd91c542/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (L63)` [2]: `ebfd91c542/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (L91)` Author: Cheng Lian <lian@databricks.com> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing	2015-08-10 09:07:08 -07:00
Josh Rosen	23cf5af08d	[SPARK-9703] [SQL] Refactor EnsureRequirements to avoid certain unnecessary shuffles This pull request refactors the `EnsureRequirements` planning rule in order to avoid the addition of certain unnecessary shuffles. As an example of how unnecessary shuffles can occur, consider SortMergeJoin, which requires clustered distribution and sorted ordering of its children's input rows. Say that both of SMJ's children produce unsorted output but are both SinglePartition. In this case, we will need to inject sort operators but should not need to inject Exchanges. Unfortunately, it looks like the EnsureRequirements unnecessarily repartitions using a hash partitioning. This patch solves this problem by refactoring `EnsureRequirements` to properly implement the `compatibleWith` checks that were broken in earlier implementations. See the significant inline comments for a better description of how this works. The majority of this PR is new comments and test cases, with few actual changes to the code. Author: Josh Rosen <joshrosen@databricks.com> Closes #7988 from JoshRosen/exchange-fixes and squashes the following commits: 38006e7 [Josh Rosen] Rewrite EnsureRequirements _yet again_ to make things even simpler 0983f75 [Josh Rosen] More guarantees vs. compatibleWith cleanup; delete BroadcastPartitioning. 8784bd9 [Josh Rosen] Giant comment explaining compatibleWith vs. guarantees 1307c50 [Josh Rosen] Update conditions for requiring child compatibility. 18cddeb [Josh Rosen] Rename DummyPlan to DummySparkPlan. 2c7e126 [Josh Rosen] Merge remote-tracking branch 'origin/master' into exchange-fixes fee65c4 [Josh Rosen] Further refinement to comments / reasoning 642b0bb [Josh Rosen] Further expand comment / reasoning 06aba0c [Josh Rosen] Add more comments 8dbc845 [Josh Rosen] Add even more tests. 4f08278 [Josh Rosen] Fix the test by adding the compatibility check to EnsureRequirements a1c12b9 [Josh Rosen] Add failing test to demonstrate allCompatible bug 0725a34 [Josh Rosen] Small assertion cleanup. 5172ac5 [Josh Rosen] Add test for requiresChildrenToProduceSameNumberOfPartitions. 2e0f33a [Josh Rosen] Write a more generic test for EnsureRequirements. 752b8de [Josh Rosen] style fix c628daf [Josh Rosen] Revert accidental ExchangeSuite change. c9fb231 [Josh Rosen] Rewrite exchange to fix better handle this case. adcc742 [Josh Rosen] Move test to PlannerSuite. 0675956 [Josh Rosen] Preserving ordering and partitioning in row format converters also does not help. cc5669c [Josh Rosen] Adding outputPartitioning to Repartition does not fix the test. 2dfc648 [Josh Rosen] Add failing test illustrating bad exchange planning.	2015-08-09 14:26:01 -07:00
Yijie Shen	68ccc6e184	[SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8057 from yjshen/explode_star and squashes the following commits: eae181d [Yijie Shen] change explaination message 54c9d11 [Yijie Shen] meaning message for * in explode	2015-08-09 11:44:51 -07:00
Reynold Xin	e9c36938ba	[SPARK-9752][SQL] Support UnsafeRow in Sample operator. In order for this to work, I had to disable gap sampling. Author: Reynold Xin <rxin@databricks.com> Closes #8040 from rxin/SPARK-9752 and squashes the following commits: f9e248c [Reynold Xin] Fix the test case for real this time. adbccb3 [Reynold Xin] Fixed test case. 589fb23 [Reynold Xin] Merge branch 'SPARK-9752' of github.com:rxin/spark into SPARK-9752 55ccddc [Reynold Xin] Fixed core test. 78fa895 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator. c9e7112 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.	2015-08-09 10:58:36 -07:00
Joseph Batchik	a3aec918be	[SPARK-9486][SQL] Add data source aliasing for external packages Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader	2015-08-08 11:03:01 -07:00
Wenchen Fan	106c0789d8	[SPARK-9738] [SQL] remove FromUnsafe and add its codegen version to GenerateSafe In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8029 from cloud-fan/from-unsafe and squashes the following commits: ed40d8f [Wenchen Fan] add the copy back a93fd4b [Wenchen Fan] cogengen FromUnsafe	2015-08-08 08:33:14 -07:00
Reynold Xin	998f4ff94d	[SPARK-9754][SQL] Remove TypeCheck in debug package. TypeCheck no longer applies in the new "Tungsten" world. Author: Reynold Xin <rxin@databricks.com> Closes #8043 from rxin/SPARK-9754 and squashes the following commits: 4ec471e [Reynold Xin] [SPARK-9754][SQL] Remove TypeCheck in debug package.	2015-08-07 19:09:28 -07:00
Michael Armbrust	49702bd738	[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`. For formats like parquet this is very costly due to the buffers required to get good compression. In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen. As such each task will open no more than `spark.sql.sources.maxFiles` files. I also did the following cleanup: - Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations. - The control flow for instantiating and invoking a writer container has been simplified. Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`. - `InternalOutputWriter` has been removed. Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method. This method can be overridden by internal datasources to avoid the conversion. This change remove a lot of code duplication and per-row `asInstanceOf` checks. - `commands.scala` has been split up. Author: Michael Armbrust <michael@databricks.com> Closes #8010 from marmbrus/fsWriting and squashes the following commits: 00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes 775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting 17b690e [Michael Armbrust] remove comment 40f0372 [Michael Armbrust] address comments f5675bd [Michael Armbrust] char -> string 7e2d0a4 [Michael Armbrust] make sure we close current writer 8100100 [Michael Armbrust] delete empty commands.scala 71cc717 [Michael Armbrust] update comment 8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions	2015-08-07 16:24:50 -07:00
Andrew Or	881548ab20	[SPARK-9674] Re-enable ignored test in SQLQuerySuite The original code that this test tests is removed in `9270bd06fd`. It was ignored shortly before that so we never caught it. This patch re-enables the test and adds the code necessary to make it pass. JoshRosen yhuai Author: Andrew Or <andrew@databricks.com> Closes #8015 from andrewor14/SPARK-9674 and squashes the following commits: 225eac2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into SPARK-9674 8c24209 [Andrew Or] Fix NPE e541d64 [Andrew Or] Track aggregation memory for both sort and hash 0be3a42 [Andrew Or] Fix test	2015-08-07 14:20:13 -07:00
Reynold Xin	05d04e10a8	[SPARK-9733][SQL] Improve physical plan explain for data sources All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source. Without this patch: ``` == Physical Plan == NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false)) Exchange hashpartitioning(date#0,cat#1) NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false)) PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at ``` With this patch: ``` == Physical Plan == TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)] Exchange hashpartitioning(date#0,cat#1) TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)] ConvertToUnsafe Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2] ``` Author: Reynold Xin <rxin@databricks.com> Closes #8024 from rxin/SPARK-9733 and squashes the following commits: 811b90e [Reynold Xin] Fixed Python test case. 52cab77 [Reynold Xin] Cast. eea9ccc [Reynold Xin] Fix test case. fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.	2015-08-07 13:41:45 -07:00
Reynold Xin	76eaa70183	[SPARK-9674][SPARK-9667] Remove SparkSqlSerializer2 It is now subsumed by various Tungsten operators. Author: Reynold Xin <rxin@databricks.com> Closes #7981 from rxin/SPARK-9674 and squashes the following commits: 144f96e [Reynold Xin] Re-enable test 58b7332 [Reynold Xin] Disable failing list. fb797e3 [Reynold Xin] Match all UDTs. be9f243 [Reynold Xin] Updated if. 71fc99c [Reynold Xin] [SPARK-9674][SPARK-9667] Remove GeneratedAggregate & SparkSqlSerializer2.	2015-08-07 11:02:53 -07:00
zsxwing	ebfd91c542	[SPARK-9467][SQL]Add SQLMetric to specialize accumulators to avoid boxing This PR adds SQLMetric/SQLMetricParam/SQLMetricValue to specialize accumulators to avoid boxing. All SQL metrics should use these classes rather than `Accumulator`. Author: zsxwing <zsxwing@gmail.com> Closes #7996 from zsxwing/sql-accu and squashes the following commits: 14a5f0a [zsxwing] Address comments 367ca23 [zsxwing] Use localValue directly to avoid changing Accumulable 42f50c3 [zsxwing] Add SQLMetric to specialize accumulators to avoid boxing	2015-08-07 00:09:58 -07:00
Wenchen Fan	e57d6b5613	[SPARK-9683] [SQL] copy UTF8String when convert unsafe array/map to safe When we convert unsafe row to safe row, we will do copy if the column is struct or string type. However, the string inside unsafe array/map are not copied, which may cause problems. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7990 from cloud-fan/copy and squashes the following commits: c13d1e3 [Wenchen Fan] change test name fe36294 [Wenchen Fan] we should deep copy UTF8String when convert unsafe row to safe row	2015-08-07 00:00:43 -07:00
Reynold Xin	4309262ec9	[SPARK-9700] Pick default page size more intelligently. Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node). This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available. Author: Reynold Xin <rxin@databricks.com> Closes #8012 from rxin/pagesize and squashes the following commits: 16f4756 [Reynold Xin] Fixed failing test. 5afd570 [Reynold Xin] private... 0d5fb98 [Reynold Xin] Update default value. 674a6cd [Reynold Xin] Address review feedback. dc00e05 [Reynold Xin] Merge with master. 73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.	2015-08-06 23:18:29 -07:00
Michael Armbrust	0867b23c74	[SPARK-9650][SQL] Fix quoting behavior on interpolated column names Make sure that `$"column"` is consistent with other methods with respect to backticks. Adds a bunch of tests for various ways of constructing columns. Author: Michael Armbrust <michael@databricks.com> Closes #7969 from marmbrus/namesWithDots and squashes the following commits: 53ef3d7 [Michael Armbrust] [SPARK-9650][SQL] Fix quoting behavior on interpolated column names 2bf7a92 [Michael Armbrust] WIP	2015-08-06 17:31:16 -07:00
Yin Huai	3504bf3aa9	[SPARK-9630] [SQL] Clean up new aggregate operators (SPARK-9240 follow up) This is the followup of https://github.com/apache/spark/pull/7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`. Author: Yin Huai <yhuai@databricks.com> Closes #7954 from yhuai/agg-followUp and squashes the following commits: 4d2f4fc [Yin Huai] Add comments and free map. 0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process. 91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to TungstenAggregateIteraotr and make it only work with UnsafeRow.	2015-08-06 15:04:44 -07:00
Wenchen Fan	1f62f104c7	[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info This re-applies #7955, which was reverted due to a race condition to fix build breaking. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Reynold Xin <rxin@databricks.com> Closes #8002 from rxin/InternalRow-toSeq and squashes the following commits: 332416a [Reynold Xin] Merge pull request #7955 from cloud-fan/toSeq 21665e2 [Wenchen Fan] fix hive again... 4addf29 [Wenchen Fan] fix hive bc16c59 [Wenchen Fan] minor fix 33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq 3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow	2015-08-06 13:11:59 -07:00
Davies Liu	2eca46a17a	Revert "[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info" This reverts commit `6e009cb9c4`.	2015-08-06 11:15:37 -07:00
Wenchen Fan	6e009cb9c4	[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7955 from cloud-fan/toSeq and squashes the following commits: 21665e2 [Wenchen Fan] fix hive again... 4addf29 [Wenchen Fan] fix hive bc16c59 [Wenchen Fan] minor fix 33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq 3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow	2015-08-06 10:40:54 -07:00
Reynold Xin	5e1b0ef079	[SPARK-9659][SQL] Rename inSet to isin to match Pandas function. Inspiration drawn from this blog post: https://lab.getbase.com/pandarize-spark-dataframes/ Author: Reynold Xin <rxin@databricks.com> Closes #7977 from rxin/isin and squashes the following commits: 9b1d3d6 [Reynold Xin] Added return. 2197d37 [Reynold Xin] Fixed test case. 7c1b6cf [Reynold Xin] Import warnings. 4f4a35d [Reynold Xin] [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.	2015-08-06 10:39:16 -07:00
Burak Yavuz	98e69467d4	[SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten In short: 1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings. 2- Merging two partitions had a bug: Existing behavior with size 3 Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4) Partition B -> Map(4 -> 25) Result -> Map() Correct Behavior: Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4) Partition B -> Map(4 -> 25) Result -> Map(3 -> 1, 4 -> 22) cc mengxr rxin JoshRosen Author: Burak Yavuz <brkyvz@gmail.com> Closes #7945 from brkyvz/freq-fix and squashes the following commits: 07fa001 [Burak Yavuz] address 2 1dc61a8 [Burak Yavuz] address 1 506753e [Burak Yavuz] fixed and added reg test 47bfd50 [Burak Yavuz] pushing	2015-08-06 10:29:40 -07:00
Davies Liu	5b965d64ee	[SPARK-9644] [SQL] Support update DecimalType with precision > 18 in UnsafeRow In order to support update a varlength (actually fixed length) object, the space should be preserved even it's null. And, we can't call setNullAt(i) for it anymore, we because setNullAt(i) will remove the offset of the preserved space, should call setDecimal(i, null, precision) instead. After this, we can do hash based aggregation on DecimalType with precision > 18. In a tests, this could decrease the end-to-end run time of aggregation query from 37 seconds (sort based) to 24 seconds (hash based). cc rxin Author: Davies Liu <davies@databricks.com> Closes #7978 from davies/update_decimal and squashes the following commits: bed8100 [Davies Liu] isSettable -> isMutable 923c9eb [Davies Liu] address comments and fix bug 385891d [Davies Liu] Merge branch 'master' of github.com:apache/spark into update_decimal 36a1872 [Davies Liu] fix tests cd6c524 [Davies Liu] support set decimal with precision > 18	2015-08-06 09:10:57 -07:00
zhichao.li	aead18ffca	[SPARK-8266] [SQL] add function translate ![translate](http://www.w3resource.com/PostgreSQL/postgresql-translate-function.png) Author: zhichao.li <zhichao.li@intel.com> Closes #7709 from zhichao-li/translate and squashes the following commits: 9418088 [zhichao.li] refine checking condition f2ab77a [zhichao.li] clone string 9d88f2d [zhichao.li] fix indent 6aa2962 [zhichao.li] style e575ead [zhichao.li] add python api 9d4bab0 [zhichao.li] add special case for fodable and refactor unittest eda7ad6 [zhichao.li] update to use TernaryExpression cdfd4be [zhichao.li] add function translate	2015-08-06 09:02:30 -07:00
Reynold Xin	9270bd06fd	[SPARK-9674][SQL] Remove GeneratedAggregate. The new aggregate replaces the old GeneratedAggregate. Author: Reynold Xin <rxin@databricks.com> Closes #7983 from rxin/remove-generated-agg and squashes the following commits: 8334aae [Reynold Xin] [SPARK-9674][SQL] Remove GeneratedAggregate.	2015-08-05 21:50:14 -07:00
Yin Huai	4581badbc8	[SPARK-9611] [SQL] Fixes a few corner cases when we spill a UnsafeFixedWidthAggregationMap This PR has the following three small fixes. 1. UnsafeKVExternalSorter does not use 0 as the initialSize to create an UnsafeInMemorySorter if its BytesToBytesMap is empty. 2. We will not not spill a InMemorySorter if it is empty. 3. We will not add a SpillReader to a SpillMerger if this SpillReader is empty. JIRA: https://issues.apache.org/jira/browse/SPARK-9611 Author: Yin Huai <yhuai@databricks.com> Closes #7948 from yhuai/unsafeEmptyMap and squashes the following commits: 9727abe [Yin Huai] Address Josh's comments. 34b6f76 [Yin Huai] 1. UnsafeKVExternalSorter does not use 0 as the initialSize to create an UnsafeInMemorySorter if its BytesToBytesMap is empty. 2. Do not spill a InMemorySorter if it is empty. 3. Do not add spill to SpillMerger if this spill is empty.	2015-08-05 19:19:09 -07:00
Josh Rosen	9c878923db	[SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in SMJ This patches renames `RowOrdering` to `InterpretedOrdering` and updates SortMergeJoin to use the `SparkPlan` methods for constructing its ordering so that it may benefit from codegen. This is an updated version of #7408. Author: Josh Rosen <joshrosen@databricks.com> Closes #7973 from JoshRosen/SPARK-9054 and squashes the following commits: e610655 [Josh Rosen] Add comment RE: Ascending ordering 34b8e0c [Josh Rosen] Import ordering be19a0f [Josh Rosen] [SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in more places.	2015-08-05 16:33:42 -07:00
Liang-Chi Hsieh	e1e05873fc	[SPARK-9403] [SQL] Add codegen support in In and InSet This continues tarekauel's work in #7778. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7893 from viirya/codegen_in and squashes the following commits: 81ff97b [Liang-Chi Hsieh] For comments. 47761c6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in cf4bf41 [Liang-Chi Hsieh] For comments. f532b3c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in 446bbcd [Liang-Chi Hsieh] Fix bug. b3d0ab4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in 4610eff [Liang-Chi Hsieh] Relax the types of references and update optimizer test. 224f18e [Liang-Chi Hsieh] Beef up the test cases for In and InSet to include all primitive data types. 86dc8aa [Liang-Chi Hsieh] Only convert In to InSet when the number of items in set is more than the threshold. b7ded7e [Tarek Auel] [SPARK-9403][SQL] codeGen in / inSet	2015-08-05 11:38:56 -07:00
Michael Armbrust	23d982204b	[SPARK-9141] [SQL] Remove project collapsing from DataFrame API Currently we collapse successive projections that are added by `withColumn`. However, this optimization violates the constraint that adding nodes to a plan will never change its analyzed form and thus breaks caching. Instead of doing early optimization, in this PR I just fix some low-hanging slowness in the analyzer. In particular, I add a mechanism for skipping already analyzed subplans, `resolveOperators` and `resolveExpression`. Since trees are generally immutable after construction, it's safe to annotate a plan as already analyzed as any transformation will create a new tree with this bit no longer set. Together these result in a faster analyzer than before, even with added timing instrumentation. ``` Original Code [info] 3430ms [info] 2205ms [info] 1973ms [info] 1982ms [info] 1916ms Without Project Collapsing in DataFrame [info] 44610ms [info] 45977ms [info] 46423ms [info] 46306ms [info] 54723ms With analyzer optimizations [info] 6394ms [info] 4630ms [info] 4388ms [info] 4093ms [info] 4113ms With resolveOperators [info] 2495ms [info] 1380ms [info] 1685ms [info] 1414ms [info] 1240ms ``` Author: Michael Armbrust <michael@databricks.com> Closes #7920 from marmbrus/withColumnCache and squashes the following commits: 2145031 [Michael Armbrust] fix hive udfs tests 5a5a525 [Michael Armbrust] remove wrong comment 7a507d5 [Michael Armbrust] style b59d710 [Michael Armbrust] revert small change 1fa5949 [Michael Armbrust] move logic into LogicalPlan, add tests 0e2cb43 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into withColumnCache c926e24 [Michael Armbrust] naming e593a2d [Michael Armbrust] style f5a929e [Michael Armbrust] [SPARK-9141][SQL] Remove project collapsing from DataFrame API 38b1c83 [Michael Armbrust] WIP	2015-08-05 09:01:45 -07:00
Cheng Hao	519cf6d3f7	[SPARK-9381] [SQL] Migrate JSON data source to the new partitioning data source Support partitioning for the JSON data source. Still 2 open issues for the `HadoopFsRelation` - `refresh()` will invoke the `discoveryPartition()`, which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting" - When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable `InsertSuite.Caching`) Author: Cheng Hao <hao.cheng@intel.com> Closes #7696 from chenghao-intel/json and squashes the following commits: d90b104 [Cheng Hao] revert the change for JacksonGenerator.apply 307111d [Cheng Hao] fix bug in the unit test 8738c8a [Cheng Hao] fix bug in unit testing 35f2cde [Cheng Hao] support partition for json format	2015-08-05 22:35:55 +08:00
zsxwing	1b0317f64c	[SPARK-8861][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab This PR includes the following changes: ### SPARK-8862: Add basic instrumentation to each SparkPlan operator A SparkPlan can override `def accumulators: Map[String, Accumulator[_]]` to expose its metrics that can be displayed in UI. The UI will use them to track the updates and show them in the web page in real-time. ### SparkSQLExecution and SQLSparkListener `SparkSQLExecution.withNewExecutionId` will set `spark.sql.execution.id` to the local properties so that we can use it to track all jobs that belong to the same query. SQLSparkListener is a listener to track all accumulator updates of all tasks for a query. It receives them from heartbeats can the UI can query them in real-time. When running a query, `SQLSparkListener.onExecutionStart` will be called. When a query is finished, `SQLSparkListener.onExecutionEnd` will be called. And the Spark jobs with the same execution id will be tracked and stored with this query. `SQLSparkListener` has to store all accumulator updates for tasks separately. When a task fails and starts to retry, we need to drop the old accumulator updates. Because we can not revert our changes to an accumulator, we have to maintain these accumulator updates by ourselves so as to drop accumulator updates for a failed task. ### SPARK-8862: A new SQL tab Includes two pages: #### A page for all DataFrame/SQL queries It will show the running, completed and failed queries in 3 tables. It also displays the jobs and their links for a query in each row. #### A detail page for a DataFrame/SQL query In this page, it also shows the SparkPlan metrics in real-time. Run a long-running query, such as ``` val testData = sc.parallelize((1 to 1000000).map(i => (i, i.toString))).toDF() testData.select($"_1").filter($"_1" < 1000).foreach(_ => Thread.sleep(60)) ``` and you will see the metrics keep updating in real-time. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7774) <!-- Reviewable:end --> Author: zsxwing <zsxwing@gmail.com> Closes #7774 from zsxwing/sql-ui and squashes the following commits: 5a2bc99 [zsxwing] Remove UISeleniumSuite and its dependency 57d4cd2 [zsxwing] Use VisibleForTesting annotation cc1c736 [zsxwing] Add SparkPlan.trackNumOfRowsEnabled to make subclasses easy to track the number of rows; fix the issue that the "save" action cannot collect metrics 3771ab0 [zsxwing] Register SQL metrics accmulators 3a101c0 [zsxwing] Change prepareCalled's type to AtomicBoolean for thread-safety b8d5605 [zsxwing] Make prepare idempotent; call children's prepare in SparkPlan.prepare; change doPrepare to def 4ed11a1 [zsxwing] var -> val 332639c [zsxwing] Ignore UISeleniumSuite and SQLListenerSuite."no memory leak" because of SPARK-9580 bb52359 [zsxwing] Address other commens in SQLListener c4d0f5d [zsxwing] Move newPredicate out of the iterator loop 957473c [zsxwing] Move STATIC_RESOURCE_DIR to object SQLTab 7ab4816 [zsxwing] Make SparkPlan accumulator API private[sql] dae195e [zsxwing] Fix the code style and comments 3a66207 [zsxwing] Ignore irrelevant accumulators b8484a1 [zsxwing] Merge branch 'master' into sql-ui 9406592 [zsxwing] Implement the SparkPlan viz 4ebce68 [zsxwing] Add SparkPlan.prepare to support BroadcastHashJoin to run background work in parallel ca1811f [zsxwing] Merge branch 'master' into sql-ui fef6fc6 [zsxwing] Fix a corner case 25f335c [zsxwing] Fix the code style 6eae828 [zsxwing] SQLSparkListener -> SQLListener; SparkSQLExecutionUIData -> SQLExecutionUIData; SparkSQLExecution -> SQLExecution 822af75 [zsxwing] Add SQLSparkListenerSuite and fix the issue about onExecutionEnd and onJobEnd 6be626f [zsxwing] Add UISeleniumSuite to test UI d02a24d [zsxwing] Make ExecutionPage private 23abf73 [zsxwing] [SPARK-8862][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab	2015-08-05 01:51:22 -07:00
Emiliano Leporati	1d1a76c8c5	[SPARK-9581][SQL] Add unit test for JSON UDT This brings #7416 up-to-date by drubbo. Author: Emiliano Leporati <emiliano.leporati@gmail.com> Author: Reynold Xin <rxin@databricks.com> Closes #7917 from rxin/udt-json-test and squashes the following commits: 93e3954 [Reynold Xin] Fix test. 7035308 [Reynold Xin] Merge pull request #7416 from drubbo/master b5bcd94 [Emiliano Leporati] removed unneded case in MyDenseVector::equals 508a399 [Emiliano Leporati] Merge remote branch 'upstream/master' 7569e42 [Emiliano Leporati] using checkAnswer 62daccd [Emiliano Leporati] added coverage for UDTs in JSON RDDs	2015-08-05 00:42:08 -07:00
Davies Liu	781c8d71a0	[SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType Let Decimal carry the correct precision and scale with DecimalType. cc rxin yhuai Author: Davies Liu <davies@databricks.com> Closes #7925 from davies/decimal_scale and squashes the following commits: e19701a [Davies Liu] some tweaks 57d78d2 [Davies Liu] fix tests 5d5bc69 [Davies Liu] match precision and scale with DecimalType	2015-08-04 23:12:49 -07:00
Pedro Rodriguez	d34548587a	[SPARK-8231] [SQL] Add array_contains This PR is based on #7580 , thanks to EntilZha PR for work on https://issues.apache.org/jira/browse/SPARK-8231 Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128 Main points are: 1. If the array is empty, null, or the value is null, return false 2. If there is a type mismatch, throw error 3. If comparison is not supported, throw error Closes #7580 Author: Pedro Rodriguez <prodriguez@trulia.com> Author: Pedro Rodriguez <ski.rodriguez@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #7949 from davies/array_contains and squashes the following commits: d3c08bc [Davies Liu] use foreach() to avoid copy bc3d1fe [Davies Liu] fix array_contains 719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains e352cf9 [Pedro Rodriguez] fixed diff from master 4d5b0ff [Pedro Rodriguez] added docs and another type check ffc0591 [Pedro Rodriguez] fixed unit test 7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3 b5ffae8 [Pedro Rodriguez] fixed pyspark test 4e7dce3 [Pedro Rodriguez] added more docs 3082399 [Pedro Rodriguez] fixed unit test 46f9789 [Pedro Rodriguez] reverted change d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this 8528027 [Pedro Rodriguez] added more tests 686e029 [Pedro Rodriguez] fix scala style d262e9d [Pedro Rodriguez] reworked type checking code and added more tests 2517a58 [Pedro Rodriguez] removed unused import 28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests 12f8795 [Pedro Rodriguez] fix scala style checks e8a20a9 [Pedro Rodriguez] added python df (broken atm) 65b562c [Pedro Rodriguez] made array_contains nullable false 33b45aa [Pedro Rodriguez] reordered test 9623c64 [Pedro Rodriguez] fixed test 4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs 72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs 69c46fb [Pedro Rodriguez] added tests and codegen 9e0bfc4 [Pedro Rodriguez] initial attempt at implementation	2015-08-04 22:34:02 -07:00
Wenchen Fan	7c8fc1f7cb	[SPARK-9598][SQL] do not expose generic getter in internal row Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7932 from cloud-fan/generic-getter and squashes the following commits: c60de4c [Wenchen Fan] do not expose generic getter in internal row	2015-08-04 17:05:19 -07:00
Josh Rosen	ab8ee1a3b9	[SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages. Author: Josh Rosen <joshrosen@databricks.com> Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits: 967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 948c344 [Josh Rosen] Add large records tests for KV sorter. 3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method 380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite a49baef [Josh Rosen] Address initial round of review comments 3edb931 [Josh Rosen] Remove accidentally-committed debug statements. 2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.	2015-08-04 14:42:11 -07:00
Davies Liu	73dedb589d	[SPARK-8246] [SQL] Implement get_json_object This is based on #7485 , thanks to NathanHowell Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead. Author: Davies Liu <davies@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Nathan Howell <nhowell@godaddy.com> Closes #7901 from davies/get_json_object and squashes the following commits: 3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object 98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala c818519 [Yin Huai] new results. 18ce26b [Davies Liu] fix tests 6ac29fb [Yin Huai] Golden files. 25eebef [Davies Liu] use HiveQuerySuite e0ac6ec [Yin Huai] Golden answer files. 940c060 [Davies Liu] tweat code style 44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object 9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element 8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object	2015-08-04 09:07:09 -07:00
Tarek Auel	b1f88a38d5	[SPARK-8244] [SQL] string function: find in set This PR is based on #7186 (just fix the conflict), thanks to tarekauel . find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. Closes #7186 Author: Tarek Auel <tarek.auel@googlemail.com> Author: Davies Liu <davies@databricks.com> Closes #7900 from davies/find_in_set and squashes the following commits: 4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility 1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8 0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244 208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix 71b2e69 [Tarek Auel] [SPARK-8244] find_in_set 66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244 61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison 4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244 e3b20c8 [Tarek Auel] [SPARK-8244] added type check 1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet	2015-08-04 08:59:42 -07:00
Sean Owen	76d74090d6	[SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. I'll explain several of the changes inline in comments. Author: Sean Owen <sowen@cloudera.com> Closes #7862 from srowen/SPARK-9534 and squashes the following commits: ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.	2015-08-04 12:02:26 +01:00
Steve Loughran	a2409d1c8e	[SPARK-8064] [SQL] Build against Hive 1.2.1 Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork. Tests not run yet: that's what the machines are for Author: Steve Loughran <stevel@hortonworks.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits: 7556d85 [Cheng Lian] Updates .q files and corresponding golden files ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002 6a92bb0 [Cheng Lian] Overrides HiveConf time vars dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe 0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header... fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark 7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar 376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration 2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically. 6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import da310dc [Michael Armbrust] Fixes for Hive tests. a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete 7404f34 [Patrick Wendell] Add spark-hive staging repo 832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code 312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand" c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first 4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests 314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests 17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly. d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options 23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens 54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase 0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1 dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType 051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark 6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call e6121e5 [Steve Loughran] SPARK-8064 address review comments aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text 8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output. 5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. does not address the issue 642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing 97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised. 335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log. 3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes 41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions 2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name 1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6 0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread 13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1 d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops 26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT 3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1 1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text 8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause. 463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output 2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec 1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec 75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port" 3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression? 27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings 00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now) cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package 6c310b4 [Steve Loughran] SPARK-8064 subclass Hive ServerOptionsProcessor to make it public again f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere 4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1	2015-08-03 15:24:42 -07:00
Reynold Xin	b2e4b85d2d	Revert "[SPARK-9372] [SQL] Filter nulls in join keys" This reverts commit `687c8c3715`.	2015-08-03 14:51:15 -07:00
Andrew Or	702aa9d7fb	[SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations This patch exposes the memory used by internal data structures on the SparkUI. This tracks memory used by all spilling operations and SQL operators backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, `GeneratedAggregate` etc. The metric exposed is "peak execution memory", which broadly refers to the peak in-memory sizes of each of these data structure. A separate patch will extend this by linking the new information to the SQL operators themselves. <img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974776/b90fc980-362a-11e5-9e2b-842da75b1641.png"> <img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974777/baa76492-362a-11e5-9b77-e364a6a6b64e.png"> <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7770) <!-- Reviewable:end --> Author: Andrew Or <andrew@databricks.com> Closes #7770 from andrewor14/expose-memory-metrics and squashes the following commits: 9abecb9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics f5b0d68 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics d7df332 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 8eefbc5 [Andrew Or] Fix non-failing tests 9de2a12 [Andrew Or] Fix tests due to another logical merge conflict 876bfa4 [Andrew Or] Fix failing test after logical merge conflict 361a359 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 40b4802 [Andrew Or] Fix style? d0fef87 [Andrew Or] Fix tests? b3b92f6 [Andrew Or] Address comments 0625d73 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics c00a197 [Andrew Or] Fix potential NPEs 10da1cd [Andrew Or] Fix compile 17f4c2d [Andrew Or] Fix compile? a87b4d0 [Andrew Or] Fix compile? d70874d [Andrew Or] Fix test compile + address comments 2840b7d [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 6aa2f7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics b889a68 [Andrew Or] Minor changes: comments, spacing, style 663a303 [Andrew Or] UnsafeShuffleWriter: update peak memory before close d090a94 [Andrew Or] Fix style 2480d84 [Andrew Or] Expand test coverage 5f1235b [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 1ecf678 [Andrew Or] Minor changes: comments, style, unused imports 0b6926c [Andrew Or] Oops 111a05e [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics a7a39a5 [Andrew Or] Strengthen presence check for accumulator a919eb7 [Andrew Or] Add tests for unsafe shuffle writer 23c845d [Andrew Or] Add tests for SQL operators a757550 [Andrew Or] Address comments b5c51c1 [Andrew Or] Re-enable test in JavaAPISuite 5107691 [Andrew Or] Add tests for internal accumulators 59231e4 [Andrew Or] Fix tests 9528d09 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 5b5e6f3 [Andrew Or] Add peak execution memory to summary table + tooltip 92b4b6b [Andrew Or] Display peak execution memory on the UI eee5437 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics d9b9015 [Andrew Or] Track execution memory in unsafe shuffles 770ee54 [Andrew Or] Track execution memory in broadcast joins 9c605a4 [Andrew Or] Track execution memory in GeneratedAggregate 9e824f2 [Andrew Or] Add back execution memory tracking for *ExternalSort 4ef4cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics e6c3e2f [Andrew Or] Move internal accumulators creation to Stage a417592 [Andrew Or] Expose memory metrics in UnsafeExternalSorter 3c4f042 [Andrew Or] Track memory usage in ExternalAppendOnlyMap / ExternalSorter bd7ab3f [Andrew Or] Add internal accumulators to TaskContext	2015-08-03 14:22:07 -07:00
Reynold Xin	7a9d09f0bb	[SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes. Author: Reynold Xin <rxin@databricks.com> Closes #7897 from rxin/calculateBitSetWidthInBytes and squashes the following commits: 2e73b3a [Reynold Xin] [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.	2015-08-03 11:22:02 -07:00
Joseph Batchik	dfe7bd168d	[SPARK-9511] [SQL] Fixed Table Name Parsing The issue was that the tokenizer was parsing "1one" into the numeric 1 using the code on line 110. I added another case to accept strings that start with a number and then have a letter somewhere else in it as well. Author: Joseph Batchik <joseph.batchik@cloudera.com> Closes #7844 from JDrit/parse_error and squashes the following commits: b8ca12f [Joseph Batchik] fixed parsing issue by adding another case	2015-08-03 11:17:38 -07:00
Wenchen Fan	137f47865d	[SPARK-9551][SQL] add a cheap version of copy for UnsafeRow to reuse a copy buffer Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7885 from cloud-fan/cheap-copy and squashes the following commits: 0900ca1 [Wenchen Fan] replace == with === 73f4ada [Wenchen Fan] add tests 07b865a [Wenchen Fan] add a cheap version of copy	2015-08-03 04:21:15 -07:00
Yin Huai	1ebd41b141	[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row This PR adds a base aggregation iterator `AggregationIterator`, which is used to create `SortBasedAggregationIterator` (for sort-based aggregation) and `UnsafeHybridAggregationIterator` (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical `Aggregate` operator and it internally determines what iterators to used. https://issues.apache.org/jira/browse/SPARK-9240 Author: Yin Huai <yhuai@databricks.com> Closes #7813 from yhuai/AggregateOperator and squashes the following commits: e317e2b [Yin Huai] Remove unnecessary change. 74d93c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into AggregateOperator ba6afbc [Yin Huai] Add a little bit more comments. c9cf3b6 [Yin Huai] update 0f1b06f [Yin Huai] Remove unnecessary code. 21fd15f [Yin Huai] Remove unnecessary change. 964f88b [Yin Huai] Implement fallback strategy. b1ea5cf [Yin Huai] wip 7fcbd87 [Yin Huai] Add a flag to control what iterator to use. 533d5b2 [Yin Huai] Prepare for fallback! 33b7022 [Yin Huai] wip bd9282b [Yin Huai] UDAFs now supports UnsafeRow. f52ee53 [Yin Huai] wip 3171f44 [Yin Huai] wip d2c45a0 [Yin Huai] wip f60cc83 [Yin Huai] Also check input schema. af32210 [Yin Huai] Check iter.hasNext before we create an iterator because the constructor of the iterato will read at least one row from a non-empty input iter. 299008c [Yin Huai] First round cleanup. 3915bac [Yin Huai] Create a base iterator class for aggregation iterators and add the initial version of the hybrid iterator.	2015-08-03 00:23:08 -07:00
Yijie Shen	98d6d9c7a9	[SPARK-9549][SQL] fix bugs in expressions JIRA: https://issues.apache.org/jira/browse/SPARK-9549 This PR fix the following bugs: 1. `UnaryMinus`'s codegen version would fail to compile when the input is `Long.MinValue` 2. `BinaryComparison` would fail to compile in codegen mode when comparing Boolean types. 3. `AddMonth` would fail if passed a huge negative month, which would lead accessing negative index of `monthDays` array. 4. `Nanvl` with different type operands. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7882 from yjshen/minor_bug_fix and squashes the following commits: 41bbd2c [Yijie Shen] fix bug in Nanvl type coercion 3dee204 [Yijie Shen] address comments 4fa5de0 [Yijie Shen] fix bugs in expressions	2015-08-03 00:15:24 -07:00
Yin Huai	687c8c3715	[SPARK-9372] [SQL] Filter nulls in join keys This PR adds an optimization rule, `FilterNullsInJoinKey`, to add `Filter` before join operators to filter out rows having null values for join keys. This optimization is guarded by a new SQL conf, `spark.sql.advancedOptimization`. The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations. Author: Yin Huai <yhuai@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #7768 from JoshRosen/filter-nulls-in-join-key and squashes the following commits: c02fc3f [Yin Huai] Address Josh's comments. 0a8e096 [Yin Huai] Update comments. ea7d5a6 [Yin Huai] Make sure we do not keep adding filters. be88760 [Yin Huai] Make it clear that FilterNullsInJoinKeySuite.scala is used to test FilterNullsInJoinKey. 8bb39ad [Yin Huai] Fix non-deterministic tests. 303236b [Josh Rosen] Revert changes that are unrelated to null join key filtering 40eeece [Josh Rosen] Merge remote-tracking branch 'origin/master' into filter-nulls-in-join-key c57a954 [Yin Huai] Bug fix. d3d2e64 [Yin Huai] First round of cleanup. f9516b0 [Yin Huai] Style c6667e7 [Yin Huai] Add PartitioningCollection. e616d3b [Yin Huai] wip 7c2d2d8 [Yin Huai] Bug fix and refactoring. 69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning. d5b84c3 [Yin Huai] Do not add unnessary filters. 2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.	2015-08-02 23:32:09 -07:00
Yin Huai	114ff926fc	[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins This PR adds `PartitioningCollection`, which is used to represent the `outputPartitioning` for SparkPlans with multiple children (e.g. `ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and `right.outputPartitioning`. So when we have a query like `select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x)` will only have three Exchange operators (when shuffled joins are needed) instead of four. The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7773) <!-- Reviewable:end --> Author: Yin Huai <yhuai@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #7773 from JoshRosen/multi-way-join-planning-improvements and squashes the following commits: 5c45924 [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements cd8269b [Josh Rosen] Refactor test to use SQLTestUtils 2963857 [Yin Huai] Revert unnecessary SqlConf change. 73913f7 [Yin Huai] Add comments and test. Also, revert the change in ShuffledHashOuterJoin for now. 4a99204 [Josh Rosen] Delete unrelated expression change 884ab95 [Josh Rosen] Carve out only SPARK-2205 changes. 247e5fa [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements c57a954 [Yin Huai] Bug fix. d3d2e64 [Yin Huai] First round of cleanup. f9516b0 [Yin Huai] Style c6667e7 [Yin Huai] Add PartitioningCollection. e616d3b [Yin Huai] wip 7c2d2d8 [Yin Huai] Bug fix and refactoring. 69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning. d5b84c3 [Yin Huai] Do not add unnessary filters. 2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.	2015-08-02 20:44:23 -07:00
Reynold Xin	30e89111d6	[SPARK-9546][SQL] Centralize orderable data type checking. This pull request creates two isOrderable functions in RowOrdering that can be used to check whether a data type or a sequence of expressions can be used in sorting. Author: Reynold Xin <rxin@databricks.com> Closes #7880 from rxin/SPARK-9546 and squashes the following commits: f9e322d [Reynold Xin] Fixed tests. 0439b43 [Reynold Xin] [SPARK-9546][SQL] Centralize orderable data type checking.	2015-08-02 20:12:03 -07:00
Reynold Xin	9d03ad910b	[SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter. The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows: 1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types) 2. Run UnsafeKVExternalSorter on the generated data 3. Collect the output from the sorter, and make sure the keys are sorted in ascending order 4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match. 5. Check memory allocation to make sure there is no memory leak. There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records. Author: Reynold Xin <rxin@databricks.com> Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits: a08c251 [Reynold Xin] Resource cleanup. 0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.	2015-08-02 17:54:30 -07:00
Reynold Xin	2e981b7bfa	[SPARK-9531] [SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter This pull request adds a destructAndCreateExternalSorter method to UnsafeFixedWidthAggregationMap. The new method does the following: 1. Creates a new external sorter UnsafeKVExternalSorter 2. Adds all the data into an in-memory sorter, sorts them 3. Spills the sorted in-memory data to disk This method can be used to fallback to sort-based aggregation when under memory pressure. The pull request also includes accounting fixes from JoshRosen. TODOs (that can be done in follow-up PRs) - [x] Address Josh's feedbacks from #7849 - [x] More documentation and test cases - [x] Make sure we are doing memory accounting correctly with test cases (e.g. did we release the memory in BytesToBytesMap twice?) - [ ] Look harder at possible memory leaks and exception handling - [ ] Randomized tester for the KV sorter as well as the aggregation map Author: Reynold Xin <rxin@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #7860 from rxin/kvsorter and squashes the following commits: 986a58c [Reynold Xin] Bug fix. 599317c [Reynold Xin] Style fix and slightly more compact code. fe7bd4e [Reynold Xin] Bug fixes. fd71bef [Reynold Xin] Merge remote-tracking branch 'josh/large-records-in-sql-sorter' into kvsorter-with-josh-fix 3efae38 [Reynold Xin] More fixes and documentation. 45f1b09 [Josh Rosen] Ensure that spill files are cleaned up f6a9bd3 [Reynold Xin] Josh feedback. 9be8139 [Reynold Xin] Remove testSpillFrequency. 7cbe759 [Reynold Xin] [SPARK-9531][SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter. ae4a8af [Josh Rosen] Detect leaked unsafe memory in UnsafeExternalSorterSuite. 52f9b06 [Josh Rosen] Detect ShuffleMemoryManager leaks in UnsafeExternalSorter.	2015-08-02 12:32:14 -07:00
Reynold Xin	8eafa2aeb6	[SPARK-9208][SQL] Sort DataFrame functions alphabetically. Author: Reynold Xin <rxin@databricks.com> Closes #7861 from rxin/api-audit and squashes the following commits: 7200256 [Reynold Xin] [SPARK-9208][SQL] Sort DataFrame functions alphabetically.	2015-08-02 11:36:11 -07:00
Davies Liu	16b928c543	[SPARK-9529] [SQL] improve TungstenSort on DecimalType Generate prefix for DecimalType, fix the random generator of decimal cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7857 from davies/sort_decimal and squashes the following commits: 2433959 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_decimal de24253 [Davies Liu] fix style 0a54c1a [Davies Liu] sort decimal	2015-08-01 23:36:06 -07:00
Davies Liu	c1b0cbd762	[SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp This PR is based on #7643 , thanks to adrian-wang Author: Davies Liu <davies@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7847 from davies/datediff and squashes the following commits: 74333d7 [Davies Liu] fix bug 22d8a8c [Davies Liu] optimize 85cdd21 [Davies Liu] remove unnecessary tests 241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)	2015-08-01 21:46:46 -07:00
HuJiayin	00cd92f32f	[SPARK-8269] [SQL] string function: initcap This PR is based on #7208 , thanks to HuJiayin Closes #7208 Author: HuJiayin <jiayin.hu@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7850 from davies/initcap and squashes the following commits: 54472e9 [Davies Liu] fix python test 17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap 3a906e4 [Davies Liu] implement title case in UTF8String 8b2506a [HuJiayin] Update functions.py 2cd43e5 [HuJiayin] fix python style check b616c0e [HuJiayin] add python api 1f5a0ef [HuJiayin] add codegen 7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap 6a0b958 [HuJiayin] add column c79482d [HuJiayin] support soundex 7ce416b [HuJiayin] support initcap rebase code	2015-08-01 21:44:57 -07:00
Reynold Xin	3d1535d488	[SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key. This is needed so we can fallback to external sorting for aggregation. Author: Reynold Xin <rxin@databricks.com> Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits: 75018c6 [Reynold Xin] Updated documentation. 81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.	2015-08-01 13:20:26 -07:00
zhichao.li	c5166f7a69	[SPARK-8263] [SQL] substr/substring should also support binary type This is based on #7641, thanks to zhichao-li Closes #7641 Author: zhichao.li <zhichao.li@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7848 from davies/substr and squashes the following commits: 461b709 [Davies Liu] remove bytearry from tests b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr 01d795e [zhichao.li] scala style 99aa130 [zhichao.li] add substring to dataframe 4f68bfe [zhichao.li] add binary type support for substring	2015-08-01 08:48:46 -07:00
Cheng Hao	cf6c9ca32a	[SPARK-8232] [SQL] Add sort_array support This PR is based on #7581 , just fix the conflict. Author: Cheng Hao <hao.cheng@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7851 from davies/sort_array and squashes the following commits: a80ef66 [Davies Liu] fix conflict 7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array 664c960 [Cheng Hao] update the sort_array by using the ArrayData 276d2d5 [Cheng Hao] add empty line 0edab9c [Cheng Hao] Add asending/descending support for sort_array 80fc0f8 [Cheng Hao] Add type checking a42b678 [Cheng Hao] Add sort_array support	2015-08-01 08:32:29 -07:00
Davies Liu	60ea7ab4bb	Revert "[SPARK-8232] [SQL] Add sort_array support" This reverts commit `67ad4e21fc`.	2015-08-01 00:41:15 -07:00
Wenchen Fan	1d59a4162b	[SPARK-9480][SQL] add MapData and cleanup internal row stuff This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`. After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it. Also did some refactor and cleanup for `InternalRow` and its subclasses. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7799 from cloud-fan/map-data and squashes the following commits: 77d482f [Wenchen Fan] fix python e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite 40cc9db [Wenchen Fan] add toString 6e06ec9 [Wenchen Fan] some more cleanup a90aca1 [Wenchen Fan] add MapData	2015-08-01 00:17:15 -07:00
Reynold Xin	d90f2cf7a2	[SPARK-9517][SQL] BytesToBytesMap should encode data the same way as UnsafeExternalSorter BytesToBytesMap current encodes key/value data in the following format: ``` 8B key length, key data, 8B value length, value data ``` UnsafeExternalSorter, on the other hand, encodes data this way: ``` 4B record length, data ``` As a result, we cannot pass records encoded by BytesToBytesMap directly into UnsafeExternalSorter for sorting. However, if we rearrange data slightly, we can then pass the key/value records directly into UnsafeExternalSorter: ``` 4B key+value length, 4B key length, key data, value data ``` Author: Reynold Xin <rxin@databricks.com> Closes #7845 from rxin/kvsort-rebase and squashes the following commits: 5716b59 [Reynold Xin] Fixed test. 2e62ccb [Reynold Xin] Updated BytesToBytesMap's data encoding to put the key first. a51b641 [Reynold Xin] Added a KV sorter interface.	2015-07-31 23:55:16 -07:00
Cheng Hao	67ad4e21fc	[SPARK-8232] [SQL] Add sort_array support Add expression `sort_array` support. Author: Cheng Hao <hao.cheng@intel.com> This patch had conflicts when merged, resolved by Committer: Davies Liu <davies.liu@gmail.com> Closes #7581 from chenghao-intel/sort_array and squashes the following commits: 664c960 [Cheng Hao] update the sort_array by using the ArrayData 276d2d5 [Cheng Hao] add empty line 0edab9c [Cheng Hao] Add asending/descending support for sort_array 80fc0f8 [Cheng Hao] Add type checking a42b678 [Cheng Hao] Add sort_array support	2015-07-31 23:11:22 -07:00
Liang-Chi Hsieh	3320b0ba26	[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate JIRA: https://issues.apache.org/jira/browse/SPARK-9415 Following up #7787. We shouldn't use MapType as grouping keys and join keys too. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7819 from viirya/map_join_groupby and squashes the following commits: 005ee0c [Liang-Chi Hsieh] For comments. 7463398 [Liang-Chi Hsieh] MapType can't be used as join keys, grouping keys.	2015-07-31 22:26:30 -07:00
zhichao.li	6996bd2e81	[SPARK-8264][SQL]add substring_index function This PR is based on #7533 , thanks to zhichao-li Closes #7533 Author: zhichao.li <zhichao.li@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7843 from davies/str_index and squashes the following commits: 391347b [Davies Liu] add python api 3ce7802 [Davies Liu] fix substringIndex f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index 515519b [zhichao.li] add foldable and remove null checking 9546991 [zhichao.li] scala style 67c253a [zhichao.li] hide some apis and clean code b19b013 [zhichao.li] add codegen and clean code ac863e9 [zhichao.li] reduce the calling of numChars 12e108f [zhichao.li] refine unittest d92951b [zhichao.li] add lastIndexOf 52d7b03 [zhichao.li] add substring_index function	2015-07-31 21:18:01 -07:00
Reynold Xin	03377d2522	[SPARK-9358][SQL] Code generation for UnsafeRow joiner. This patch creates a code generated unsafe row concatenator that can be used to concatenate/join two UnsafeRows into a single UnsafeRow. Since it is inherently hard to test these low level stuff, the test suites employ randomized testing heavily in order to guarantee correctness. Author: Reynold Xin <rxin@databricks.com> Closes #7821 from rxin/rowconcat and squashes the following commits: 8717f35 [Reynold Xin] Rebase and code review. 72c5d8e [Reynold Xin] Fixed a bug. a84ed2e [Reynold Xin] Fixed offset. 40c3fb2 [Reynold Xin] Reset random data generator. f0913aa [Reynold Xin] Test fixes. 6687b6f [Reynold Xin] Updated documentation. 00354b9 [Reynold Xin] Support concat data as well. e9a4347 [Reynold Xin] Updated. 6269f96 [Reynold Xin] Fixed a bug . 0f89716 [Reynold Xin] [SPARK-9358][SQL][WIP] Code generation for UnsafeRow concat.	2015-07-31 21:09:00 -07:00
Josh Rosen	8cb415a4b9	[SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager This patch adds support for entries larger than the default page size in BytesToBytesMap. These large rows are handled by allocating special overflow pages to hold individual entries. In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager: - Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`. - Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory: - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory. The caller can use this value to respond to the memory pressure (e.g. by spilling). - `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory. - Updated all uses of these classes to handle these error conditions. - Added new tests for allocating large records and for allocations which fail due to memory pressure. - Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks. Author: Josh Rosen <joshrosen@databricks.com> Closes #7762 from JoshRosen/large-rows and squashes the following commits: ae7bc56 [Josh Rosen] Fix compilation 82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows 34ab943 [Josh Rosen] Remove semi 31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager. `626b33c` [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core. 642ed69 [Josh Rosen] Rename size to numElements bea1152 [Josh Rosen] Add basic test. 2cd3570 [Josh Rosen] Remove accidental duplicated code 07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.	2015-07-31 19:19:27 -07:00
HuJiayin	4d5a6e7b60	[SPARK-8271][SQL]string function: soundex This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738 It's based on #7115 , thanks to HuJiayin Author: HuJiayin <jiayin.hu@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7812 from davies/soundex and squashes the following commits: fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex a4bd6d8 [Davies Liu] fix soundex 2538908 [HuJiayin] add codegen soundex d15d329 [HuJiayin] add back ut ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark e2dec2c [HuJiayin] support soundex rebase code	2015-07-31 16:05:26 -07:00
Daoyuan Wang	83670fc9e6	[SPARK-8176] [SPARK-8197] [SQL] function to_date/ trunc This PR is based on #6988 , thanks to adrian-wang . This brings two SQL functions: to_date() and trunc(). Closes #6988 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7805 from davies/to_date and squashes the following commits: 2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date 310dd55 [Daoyuan Wang] remove dup test in rebase 980b092 [Daoyuan Wang] resolve rebase conflict a476c5a [Daoyuan Wang] address comments from davies d44ea5f [Daoyuan Wang] function to_date, trunc	2015-07-30 19:22:38 -07:00
Liang-Chi Hsieh	351eda0e2f	[SPARK-6319][SQL] Throw AnalysisException when using BinaryType on Join and Aggregate JIRA: https://issues.apache.org/jira/browse/SPARK-6319 Spark SQL uses plain byte arrays to represent binary values. However, the arrays are compared by reference rather than by values. Thus, we should not use BinaryType on Join and Aggregate in current implementation. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7787 from viirya/agg_no_binary_type and squashes the following commits: 4f76cac [Liang-Chi Hsieh] Throw AnalysisException when using BinaryType on Join and Aggregate.	2015-07-30 17:22:51 -07:00
Davies Liu	0b1a464b6e	[SPARK-9425] [SQL] support DecimalType in UnsafeRow This PR brings the support of DecimalType in UnsafeRow, for precision <= 18, it's settable, otherwise it's not settable. Author: Davies Liu <davies@databricks.com> Closes #7758 from davies/unsafe_decimal and squashes the following commits: 478b1ba [Davies Liu] address comments 536314c [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal 7c2e77a [Davies Liu] fix JoinedRow 76d6fa4 [Davies Liu] fix tests 99d3151 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal d49c6ae [Davies Liu] support DecimalType in UnsafeRow	2015-07-30 17:18:32 -07:00
Reynold Xin	e7a0976e99	[SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort. Author: Reynold Xin <rxin@databricks.com> Closes #7803 from rxin/SPARK-9458 and squashes the following commits: 5b032dc [Reynold Xin] Fix string. b670dbb [Reynold Xin] [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.	2015-07-30 17:17:27 -07:00
Xiangrui Meng	df32669514	[SPARK-7157][SQL] add sampleBy to DataFrame This was previously committed but then reverted due to test failures (see #6769). Author: Xiangrui Meng <meng@databricks.com> Closes #7755 from rxin/SPARK-7157 and squashes the following commits: fbf9044 [Xiangrui Meng] fix python test 542bd37 [Xiangrui Meng] update test 604fe6d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157 f051afd [Xiangrui Meng] use udf instead of building expression f4e9425 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157 8fb990b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157 103beb3 [Xiangrui Meng] add Java-friendly sampleBy 991f26f [Xiangrui Meng] fix seed 4a14834 [Xiangrui Meng] move sampleBy to stat 832f7cc [Xiangrui Meng] add sampleBy to DataFrame	2015-07-30 17:16:03 -07:00
Daoyuan Wang	1abf7dc16c	[SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation This PR is based on #7589 , thanks to adrian-wang Added SQL function date_add, date_sub, add_months, month_between, also add a rule for add/subtract of date/timestamp and interval. Closes #7589 cc rxin Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Davies Liu <davies@databricks.com> Closes #7754 from davies/date_add and squashes the following commits: e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add 9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add 6224ce4 [Davies Liu] fix conclict bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add e47ff2c [Davies Liu] add python api, fix date functions 01943d0 [Davies Liu] Merge branch 'master' into date_add 522e91a [Daoyuan Wang] fix e8a639a [Daoyuan Wang] fix 42df486 [Daoyuan Wang] fix style 87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes 1a68e03 [Daoyuan Wang] poc of time interval calculation c506661 [Daoyuan Wang] function date_add , date_sub	2015-07-30 13:21:46 -07:00
Daoyuan Wang	6d94bf6ac1	[SPARK-8174] [SPARK-8175] [SQL] function unix_timestamp, from_unixtime unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string\|date): long Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400. from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00". Jira: https://issues.apache.org/jira/browse/SPARK-8174 https://issues.apache.org/jira/browse/SPARK-8175 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7644 from adrian-wang/udfunixtime and squashes the following commits: 2fe20c4 [Daoyuan Wang] util.Date ea2ec16 [Daoyuan Wang] use util.Date for better performance a2cf929 [Daoyuan Wang] doc return null instead of 0 f6f070a [Daoyuan Wang] address comments from davies 6a4cbb3 [Daoyuan Wang] temp 56ded53 [Daoyuan Wang] rebase and address comments 14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime	2015-07-30 11:14:09 -07:00
Josh Rosen	520ec0ff9d	[SPARK-8850] [SQL] Enable Unsafe mode by default This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues: List of fixed blockers: - [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741). - [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this) - [x] Update planner to also check whether codegen is enabled before planning unsafe operators. - [x] Investigate failing HiveThriftBinaryServerSuite test. This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD. This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used. - [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603). - [x] ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF. This is necessary for `UDFSuite` to pass. For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682. - [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6). - [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736. - [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`. - [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs: - [x] Wrong answer in `join_1to1` (fixed by #7680) - [x] Wrong answer in `join_nulls` (fixed by #7680) - [x] Managed memory OOM / leak in `lateral_view` - [x] Seems to hang indefinitely in `partcols1`. This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710. - [x] Error while freeing memory in `partcols1`: will be fixed by #7734. - [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well. - [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759). Author: Josh Rosen <joshrosen@databricks.com> Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits: 83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default 963f567 [Josh Rosen] Reduce buffer size for R tests d6986de [Josh Rosen] Lower page size in PySpark tests 013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects 5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used 715517b [Josh Rosen] Enable Unsafe by default	2015-07-30 10:45:32 -07:00
Wenchen Fan	c0cc0eaec6	[SPARK-9390][SQL] create a wrapper for array type Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7724 from cloud-fan/array-data and squashes the following commits: d0408a1 [Wenchen Fan] fix python 661e608 [Wenchen Fan] rebase f39256c [Wenchen Fan] fix hive... 6dbfa6f [Wenchen Fan] fix hive again... 8cb8842 [Wenchen Fan] remove element type parameter from getArray 43e9816 [Wenchen Fan] fix mllib e719afc [Wenchen Fan] fix hive 4346290 [Wenchen Fan] address comment d4a38da [Wenchen Fan] remove sizeInBytes and add license 7e283e2 [Wenchen Fan] create a wrapper for array type	2015-07-30 10:04:30 -07:00
Liang-Chi Hsieh	6175d6cfe7	[SPARK-8838] [SQL] Add config to enable/disable merging part-files when merging parquet schema JIRA: https://issues.apache.org/jira/browse/SPARK-8838 Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema. In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7238 from viirya/option_partfile_merge and squashes the following commits: 71d5b5f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge 8816f44 [Liang-Chi Hsieh] For comments. dbc8e6b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge afc2fa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge d4ed7e6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge df43027 [Liang-Chi Hsieh] Get dataStatuses' partitions based on all paths. 4eb2f00 [Liang-Chi Hsieh] Use given parameter. ea8f6e5 [Liang-Chi Hsieh] Correct the code comments. a57be0e [Liang-Chi Hsieh] Merge part-files if there are no summary files. 47df981 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge 4caf293 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge 0e734e0 [Liang-Chi Hsieh] Use correct API. 3b6be5b [Liang-Chi Hsieh] Fix key not found. 4bdd7e0 [Liang-Chi Hsieh] Don't read footer files if we can skip them. 8bbebcb [Liang-Chi Hsieh] Figure out how to test the config. bbd4ce7 [Liang-Chi Hsieh] Add config to enable/disable merging part-files when merging parquet schema.	2015-07-30 17:45:30 +08:00
Reynold Xin	5ba2d44068	Fix flaky HashedRelationSuite SparkEnv might not have been set in local unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #7784 from rxin/HashedRelationSuite and squashes the following commits: 435d64b [Reynold Xin] Fix flaky HashedRelationSuite	2015-07-30 01:21:39 -07:00
Reynold Xin	4a8bb9d00d	Revert "[SPARK-9458] Avoid object allocation in prefix generation." This reverts commit `9514d874f0`.	2015-07-30 01:04:24 -07:00
Joseph Batchik	1221849f91	[SPARK-8005][SQL] Input file name Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it. This supports: `df.select(inputFileName())` and `sqlContext.sql("select input_file_name() from table")` Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7743 from JDrit/input_file_name and squashes the following commits: abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string d2f323d [Joseph Batchik] updates per review 102061f [Joseph Batchik] updates per review 75313f5 [Joseph Batchik] small fixes c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL	2015-07-29 23:35:55 -07:00
Reynold Xin	712465b68e	HOTFIX: disable HashedRelationSuite.	2015-07-29 22:51:06 -07:00
Reynold Xin	9514d874f0	[SPARK-9458] Avoid object allocation in prefix generation. In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation. I also removed the FLOAT prefix, opting for converting float directly to double. Author: Reynold Xin <rxin@databricks.com> Closes #7763 from rxin/sort-prefix and squashes the following commits: 5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.	2015-07-29 20:46:03 -07:00
Reynold Xin	5340dfaf94	[SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType. We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds, so interval can be compared. Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future. Author: Reynold Xin <rxin@databricks.com> Closes #7745 from rxin/calendarintervaltype and squashes the following commits: 99f64e8 [Reynold Xin] One more line ... 13466c8 [Reynold Xin] Fixed tests. e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.	2015-07-29 13:49:22 -07:00
Wenchen Fan	708794e8aa	[SPARK-9251][SQL] do not order by expressions which still need evaluation as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7593 from cloud-fan/sort and squashes the following commits: 7b1bef7 [Wenchen Fan] add test daf206d [Wenchen Fan] add more comments 289bee0 [Wenchen Fan] do not order by expressions which still need evaluation	2015-07-29 00:08:45 -07:00
Davies Liu	15667a0afa	[SPARK-9281] [SQL] use decimal or double when parsing SQL Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double. This PR change to parse float number to decimal or double, based on it's using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx This is a break change, should we doc it somewhere? Author: Davies Liu <davies@databricks.com> Closes #7642 from davies/parse_decimal and squashes the following commits: 1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal 5e142b6 [Davies Liu] fix scala style eca99de [Davies Liu] fix tests 2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal f4a320b [Davies Liu] Update SqlParser.scala 1c48e34 [Davies Liu] use decimal or double when parsing SQL	2015-07-28 22:51:08 -07:00
Wenchen Fan	429b2f0df4	[SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of nondeterministic expression before evaluation and fix PullOutNondeterministic We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it. Fix `PullOutNondeterministic` rule to make it work for `Sort`. Also got a chance to cleanup the dataframe test suite. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7674 from cloud-fan/show and squashes the following commits: 888934f [Wenchen Fan] fix sort c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`	2015-07-28 21:37:50 -07:00
Josh Rosen	e78ec1a8fa	[SPARK-9421] Fix null-handling bugs in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType) UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0). In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors. This patch addresses both of these issues and adds a regression test. Author: Josh Rosen <joshrosen@databricks.com> Closes #7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits: c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite 6214682 [Josh Rosen] Fixes to null handling in UnsafeRow	2015-07-28 17:51:58 -07:00
Reynold Xin	6662ee2124	[SPARK-9418][SQL] Use sort-merge join as the default shuffle join. Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator. Author: Reynold Xin <rxin@databricks.com> Closes #7733 from rxin/smj and squashes the following commits: 61e4d34 [Reynold Xin] Fixed test case. 5ffd731 [Reynold Xin] Fixed JoinSuite. a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.	2015-07-28 17:42:35 -07:00
Reynold Xin	b7f54119f8	[SPARK-9420][SQL] Move expressions in sql/core package to catalyst. Since catalyst package already depends on Spark core, we can move those expressions into catalyst, and simplify function registry. This is a followup of #7478. Author: Reynold Xin <rxin@databricks.com> Closes #7735 from rxin/SPARK-8003 and squashes the following commits: 2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.	2015-07-28 17:03:59 -07:00
Josh Rosen	59b92add7c	[SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult: - If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive. - TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators. - Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job. This can lead to silently incorrect query results. - The writer thread is not a daemon thread, but it should be. In addition, the code in this file is extremely messy: - Lots of fields are nullable but the nullability isn't clearly explained. - Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope. - Some code was misindented. - The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself. - There were no unit tests for the operator itself, only end-to-end tests. This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #7710 from JoshRosen/script-transform and squashes the following commits: 16c44e2 [Josh Rosen] Update some comments 983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes 6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name 494cde0 [Josh Rosen] Propagate TaskContext to writer thread 323bb2b [Josh Rosen] Fix error-swallowing bug b31258d [Josh Rosen] Rename iterator variables to disambiguate. 88278de [Josh Rosen] Split ScriptTransformation writer thread into own class. 8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue 4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer. bd4c948 [Josh Rosen] Skip launching of external command for empty partitions. b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.	2015-07-28 16:04:48 -07:00
Davies Liu	21825529ea	[SPARK-9247] [SQL] Use BytesToBytesMap for broadcast join This PR introduce BytesToBytesMap to UnsafeHashedRelation, use it in executor for better performance. It serialize all the key and values from java HashMap, put them into a BytesToBytesMap while deserializing. All the values for a same key are stored continuous to have better memory locality. This PR also address the comments for #7480 , do some clean up. Author: Davies Liu <davies@databricks.com> Closes #7592 from davies/unsafe_map2 and squashes the following commits: 42c578a [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_map2 fd09528 [Davies Liu] remove thread local cache and update docs 1c5ad8d [Davies Liu] fix test 5eb1b5a [Davies Liu] address comments in #7480 46f1f22 [Davies Liu] fix style fc221e0 [Davies Liu] use BytesToBytesMap for broadcast join	2015-07-28 15:56:19 -07:00
Joseph Batchik	b88b868eb3	[SPARK-8003][SQL] Added virtual column support to Spark Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them. This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003) This allows you to perform queries such as: ```sql select spark__partition__id, count(*) as c from table group by spark__partition__id; ``` Author: Joseph Batchik <josephbatchik@gmail.com> Author: JD <jd@csh.rit.edu> Closes #7478 from JDrit/virt_columns and squashes the following commits: 7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well f8a9c6c [Joseph Batchik] merging in master e49da48 [JD] fixes for @rxin's suggestions 60e120b [JD] fixing test in merge 4bf8554 [JD] merging in master c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()	2015-07-28 14:39:25 -07:00
Yin Huai	6cdcc21fe6	[SPARK-9196] [SQL] Ignore test DatetimeExpressionsSuite: function current_timestamp. This test is flaky. https://issues.apache.org/jira/browse/SPARK-9196 will track the fix of it. For now, let's disable this test. Author: Yin Huai <yhuai@databricks.com> Closes #7727 from yhuai/SPARK-9196-ignore and squashes the following commits: f92bded [Yin Huai] Ignore current_timestamp.	2015-07-28 13:16:48 -07:00
Aaron Davidson	35ef853b3f	[SPARK-9397] DataFrame should provide an API to find source data files if applicable Certain applications would benefit from being able to inspect DataFrames that are straightforwardly produced by data sources that stem from files, and find out their source data. For example, one might want to display to a user the size of the data underlying a table, or to copy or mutate it. This PR exposes an `inputFiles` method on DataFrame which attempts to discover the source data in a best-effort manner, by inspecting HadoopFsRelations and JSONRelations. Author: Aaron Davidson <aaron@databricks.com> Closes #7717 from aarondav/paths and squashes the following commits: ff67430 [Aaron Davidson] inputFiles 0acd3ad [Aaron Davidson] [SPARK-9397] DataFrame should provide an API to find source data files if applicable	2015-07-28 10:12:09 -07:00
Reynold Xin	9bbe0171cb	[SPARK-8196][SQL] Fix null handling & documentation for next_day. The original patch didn't handle nulls correctly for next_day. Author: Reynold Xin <rxin@databricks.com> Closes #7718 from rxin/next_day and squashes the following commits: 616a425 [Reynold Xin] Merged DatetimeExpressionsSuite into DateFunctionsSuite. faa78cf [Reynold Xin] Merged DatetimeFunctionsSuite into DateExpressionsSuite. 6c4fb6a [Reynold Xin] [SPARK-8196][SQL] Fix null handling & documentation for next_day.	2015-07-28 09:43:39 -07:00
Cheng Hao	9c5612f4e1	[MINOR] [SQL] Support mutable expression unit test with codegen projection This is actually contains 3 minor issues: 1) Enable the unit test(codegen) for mutable expressions (FormatNumber, Regexp_Replace/Regexp_Extract) 2) Use the `PlatformDependent.copyMemory` instead of the `System.arrayCopy` Author: Cheng Hao <hao.cheng@intel.com> Closes #7566 from chenghao-intel/codegen_ut and squashes the following commits: 24f43ea [Cheng Hao] enable codegen for mutable expression & UTF8String performance	2015-07-27 23:02:23 -07:00
Reynold Xin	60f08c7c87	[SPARK-9373][SQL] Support StructType in Tungsten projection This pull request updates GenerateUnsafeProjection to support StructType. If an input struct type is backed already by an UnsafeRow, GenerateUnsafeProjection copies the bytes directly into its buffer space without any conversion. However, if the input is not an UnsafeRow, GenerateUnsafeProjection runs the code generated recursively to convert the input into an UnsafeRow and then copies it into the buffer space. Also create a TungstenProject operator that projects data directly into UnsafeRow. Note that I'm not sure if this is the way we want to structure Unsafe+codegen operators, but we can defer that decision to follow-up pull requests. Author: Reynold Xin <rxin@databricks.com> Closes #7689 from rxin/tungsten-struct-type and squashes the following commits: 9162f42 [Reynold Xin] Support IntervalType in UnsafeRow's getter. be9f377 [Reynold Xin] Fixed tests. 10c4b7c [Reynold Xin] Format generated code. 77e8d0e [Reynold Xin] Fixed NondeterministicSuite. ac4951d [Reynold Xin] Yay. ac203bf [Reynold Xin] More comments. 9f36216 [Reynold Xin] Updated comment. 6b781fe [Reynold Xin] Reset the change in DataFrameSuite. 525b95b [Reynold Xin] Merged with master, more documentation & test cases. 321859a [Reynold Xin] [SPARK-9373][SQL] Support StructType in Tungsten projection [WIP]	2015-07-27 22:51:15 -07:00
Yijie Shen	63a492b931	[SPARK-8828] [SQL] Revert SPARK-5680 JIRA: https://issues.apache.org/jira/browse/SPARK-8828 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits: c37ccb1 [Yijie Shen] add test case 8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value e2305ac [Yijie Shen] fix bug - avg on decimal column 7cb0e95 [Yijie Shen] [wip] resolving bugs 1fadb5a [Yijie Shen] remove occurance 17c6248 [Yijie Shen] revert SPARK-5680	2015-07-27 22:47:33 -07:00
Reynold Xin	3bc7055e26	Fixed a test failure.	2015-07-27 22:04:54 -07:00
Daoyuan Wang	2e7f99a004	[SPARK-8195] [SPARK-8196] [SQL] udf next_day last_day next_day, returns next certain dayofweek. last_day, returns the last day of the month which given date belongs to. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6986 from adrian-wang/udfnlday and squashes the following commits: ef7e3da [Daoyuan Wang] fix 02b3426 [Daoyuan Wang] address 2 comments dc69630 [Daoyuan Wang] address comments from rxin 8846086 [Daoyuan Wang] address comments from rxin d09bcce [Daoyuan Wang] multi fix 1a9de3d [Daoyuan Wang] function next_day and last_day	2015-07-27 21:08:56 -07:00
Wenchen Fan	3ab7525dce	[SPARK-9355][SQL] Remove InternalRow.get generic getter call in columnar cache code Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7673 from cloud-fan/row-generic-getter-columnar and squashes the following commits: 88b1170 [Wenchen Fan] fix style eeae712 [Wenchen Fan] Remove Internal.get generic getter call in columnar cache code	2015-07-27 13:40:50 -07:00
Wenchen Fan	dd9ae7945a	[SPARK-9351] [SQL] remove literals from grouping expressions in Aggregate literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer. I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7583 from cloud-fan/minor and squashes the following commits: 471adff [Wenchen Fan] add test 0839925 [Wenchen Fan] use transformDown when rewrite final result expressions	2015-07-27 11:23:29 -07:00
Rene Treffer	aa19c696e2	[SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet This PR is based on #6796 authored by rtreffer. To support large decimal precisions (> 18), we do the following things in this PR: 1. Making `CatalystSchemaConverter` support large decimal precision Decimal types with large precision are always converted to fixed-length byte array. 2. Making `CatalystRowConverter` support reading decimal values with large precision When the precision is > 18, constructs `Decimal` values with an unscaled `BigInteger` rather than an unscaled `Long`. 3. Making `RowWriteSupport` support writing decimal values with large precision In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848). Two follow-up tasks should be done in future PRs: - [ ] Writing decimals as `INT32`, `INT64` when possible while fixing SPARK-8848 - [ ] Adding compatibility tests as part of SPARK-5463 Author: Cheng Lian <lian@databricks.com> Closes #7455 from liancheng/spark-4176 and squashes the following commits: a543d10 [Cheng Lian] Fixes errors introduced while rebasing 9e31cdf [Cheng Lian] Supports decimals with precision > 18 for Parquet	2015-07-27 23:29:40 +08:00
Cheng Lian	72981bc8f0	[SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally. Author: Cheng Lian <lian@databricks.com> Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits: f3bcd4b [Cheng Lian] Addresses PR comments e0eb76a [Cheng Lian] Fixes styling issues 41e2207 [Cheng Lian] Fixes multi-database support d4d1ec2 [Cheng Lian] Adds multi-database test cases	2015-07-27 17:15:35 +08:00
Liang-Chi Hsieh	945d8bcbf6	[SPARK-9306] [SQL] Don't use SortMergeJoin when joining on unsortable columns JIRA: https://issues.apache.org/jira/browse/SPARK-9306 Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7645 from viirya/smj_unsortable and squashes the following commits: a240707 [Liang-Chi Hsieh] Use forall instead of exists for readability. 55221fa [Liang-Chi Hsieh] Shouldn't use SortMergeJoin when joining on unsortable columns.	2015-07-26 22:13:37 -07:00
Cheng Hao	1efe97dc9e	[SPARK-8867][SQL] Support list / describe function usage As Hive does, we need to list all of the registered UDF and its usage for user. We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF. e.g. ```scala ExpressionDescription( usage = "_FUNC_(expr) - Returns the absolute value of the numeric value", extended = """> SELECT _FUNC_('-1') 1""") case class Abs(child: Expression) extends UnaryArithmetic { ... ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #7259 from chenghao-intel/desc_function and squashes the following commits: cf29bba [Cheng Hao] fixing the code style issue 5193855 [Cheng Hao] Add more powerful parser for show functions c645a6b [Cheng Hao] fix bug in unit test 78d40f1 [Cheng Hao] update the padding issue for usage 48ee4b3 [Cheng Hao] update as feedback 70eb4e9 [Cheng Hao] add show/describe function support	2015-07-26 18:34:19 -07:00
Cheng Lian	c025c3d0a1	[SPARK-9095] [SQL] Removes the old Parquet support This PR removes the old Parquet support: - Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites. - Renames `ParquetRelation2` to `ParquetRelation` - Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files. This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow. There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848. Author: Cheng Lian <lian@databricks.com> Closes #7441 from liancheng/spark-9095 and squashes the following commits: c7b6e38 [Cheng Lian] Removes WriteToFile 2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation ca9e1b7 [Cheng Lian] Removes old Parquet support	2015-07-26 16:49:19 -07:00
Reynold Xin	4a01bfc2a2	[SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter. Author: Reynold Xin <rxin@databricks.com> Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits: ee2874c [Reynold Xin] Add a default implementation for getStruct. 1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType. 033ee88 [Reynold Xin] Removed getAs in non test code.	2015-07-25 23:52:37 -07:00
Reynold Xin	b1f4b4abfd	[SPARK-9348][SQL] Remove apply method on InternalRow. Author: Reynold Xin <rxin@databricks.com> Closes #7665 from rxin/remove-row-apply and squashes the following commits: 0b43001 [Reynold Xin] support getString in UnsafeRow. 176d633 [Reynold Xin] apply -> get. 2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.	2015-07-25 18:41:51 -07:00
Reynold Xin	215713e199	[SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection. The two are redundant. Once this patch is merged, I plan to remove the inbound conversions from unsafe aggregates. Author: Reynold Xin <rxin@databricks.com> Closes #7658 from rxin/unsafeconverters and squashes the following commits: ed19e6c [Reynold Xin] Updated support types. 2a56d7e [Reynold Xin] [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.	2015-07-25 01:37:41 -07:00
JD	723db13e06	[Spark-8668][SQL] Adding expr to functions Author: JD <jd@csh.rit.edu> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7606 from JDrit/expr and squashes the following commits: ad7f607 [Joseph Batchik] fixing python linter error 9d6daea [Joseph Batchik] removed order by per @rxin's comment 707d5c6 [Joseph Batchik] Added expr to fuctions.py 79df83c [JD] added example to the docs b89eec8 [JD] moved function up as per @rxin's comment 4960909 [JD] updated per @JoshRosen's comment 2cb329c [JD] updated per @rxin's comment 9a9ad0c [JD] removing unused import 6dc26d0 [JD] removed split 7f2222c [JD] Adding expr function as per SPARK-8668	2015-07-25 00:34:59 -07:00
Reynold Xin	431ca39be5	[SPARK-9285][SQL] Remove InternalRow's inheritance from Row. I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields. Author: Reynold Xin <rxin@databricks.com> Closes #7626 from rxin/internalRow and squashes the following commits: e124daf [Reynold Xin] Fixed test case. 805ceb7 [Reynold Xin] Commented out the failed test suite. f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining. 76d9081 [Reynold Xin] Fixed data sources. 7807f70 [Reynold Xin] Fixed DataFrameSuite. cb60cd2 [Reynold Xin] Code review & small bug fixes. 0a2948b [Reynold Xin] Fixed style. 3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.	2015-07-24 09:37:36 -07:00
Wenchen Fan	408e64b284	[SPARK-9294][SQL] cleanup comments, code style, naming typo for the new aggregation fix some comments and code style for https://github.com/apache/spark/pull/7458 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7619 from cloud-fan/agg-clean and squashes the following commits: 3925457 [Wenchen Fan] one more... cc78357 [Wenchen Fan] one more cleanup 26f6a93 [Wenchen Fan] some minor cleanup for the new aggregation	2015-07-23 23:40:01 -07:00
Davies Liu	8a94eb23d5	[SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases). In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18). If no precision and scale is provide, it's Decimal(10, 0) as before. Author: Davies Liu <davies@databricks.com> Closes #7605 from davies/decimal_unlimited and squashes the following commits: aa3f115 [Davies Liu] fix tests and style fb0d20d [Davies Liu] address comments bfaae35 [Davies Liu] fix style df93657 [Davies Liu] address comments and clean up 06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited 4c28969 [Davies Liu] fix tests 8d783cc [Davies Liu] fix tests 788631c [Davies Liu] fix double with decimal in Union/except 1779bde [Davies Liu] fix scala style c9c7c78 [Davies Liu] remove Decimal.Unlimited	2015-07-23 18:31:13 -07:00
Reynold Xin	fb36397b3c	Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow" Reverts ObjectPool. As it stands, it has a few problems: 1. ObjectPool doesn't work with spilling and memory accounting. 2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size. 3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types. If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation. This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look. Author: Reynold Xin <rxin@databricks.com> Closes #7591 from rxin/revert-object-pool and squashes the following commits: 01db0bc [Reynold Xin] Scala style. eda89fc [Reynold Xin] Fixed describe. 2967118 [Reynold Xin] Fixed accessor for JoinedRow. e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool 657855f [Reynold Xin] Temp commit. c20f2c8 [Reynold Xin] Style fix. fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"	2015-07-23 01:51:34 -07:00
Davies Liu	e0b7ba59a1	[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin). It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR). Author: Davies Liu <davies@databricks.com> Closes #7480 from davies/unsafe_join and squashes the following commits: 6294b1e [Davies Liu] fix projection 10583f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join dede020 [Davies Liu] fix test 84c9807 [Davies Liu] address comments a05b4f6 [Davies Liu] support UnsafeRow in LeftSemiJoinBNL and BroadcastNestedLoopJoin 611d2ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 9481ae8 [Davies Liu] return UnsafeRow after join() ca2b40f [Davies Liu] revert unrelated change 68f5cd9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 0f4380d [Davies Liu] ada a comment 69e38f5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 1a40f02 [Davies Liu] refactor ab1690f [Davies Liu] address comments 60371f2 [Davies Liu] use UnsafeRow in SemiJoin a6c0b7d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 184b852 [Davies Liu] fix style 6acbb11 [Davies Liu] fix tests 95d0762 [Davies Liu] remove println bea4a50 [Davies Liu] Unsafe HashJoin	2015-07-22 13:02:43 -07:00
Yin Huai	c03299a18b	[SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367. This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true. This new code path contains: * A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`) * A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`). * A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface . * A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together). With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is: 1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface. 2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true: * code-gen is disabled. * there is any function that cannot be converted (right now, Hive UDAFs). * the schema of grouping expressions contain any complex data type. * There are multiple distinct columns. Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function. Thanks chenghao-intel for his initial work on it. Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7458 from yhuai/UDAF and squashes the following commits: 7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it. b04d6c8 [Yin Huai] Remove unnecessary change. f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator. 3b43b24 [Yin Huai] bug fix. 00eb298 [Yin Huai] Make it compile. a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path. 8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose. dc96fd1 [Yin Huai] Many updates: 85c9c4b [Yin Huai] newline. 43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF c3614d7 [Yin Huai] Handle single distinct column. 68b8ee9 [Yin Huai] Support single distinct column set. WIP 3013579 [Yin Huai] Format. d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it. e243ca6 [Yin Huai] Add aggregation iterators. a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum. 594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2. 380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places. a19fea6 [Yin Huai] Add UDAF interface. 262d4c4 [Yin Huai] Make it compile. b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 6edb5ac [Yin Huai] Format update. 70b169c [Yin Huai] Remove groupOrdering. 4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules. d821a34 [Yin Huai] Cleanup. 32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 5b46d41 [Yin Huai] Bug fix. aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions. 2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer. 1b490ed [Michael Armbrust] make hive test 8cfa6a9 [Michael Armbrust] add test 1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places. 072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references. f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF 39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences. b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version. 5c00f3f [Michael Armbrust] First draft of codegen 6bbc6ba [Michael Armbrust] now with correct answers\! f7996d0 [Michael Armbrust] Add AlgebraicAggregate dded1c5 [Yin Huai] wip	2015-07-21 23:26:11 -07:00

... 5 6 7 8 9 ...

1230 commits