ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Herman van Hovell	02a029df43	[SPARK-16220][SQL] Add scope to show functions ## What changes were proposed in this pull request? Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used: ALL (default) ```SHOW FUNCTIONS``` ```SHOW ALL FUNCTIONS``` SYSTEM ```SHOW SYSTEM FUNCTIONS``` USER ```SHOW USER FUNCTIONS``` ## How was this patch tested? Updated tests and added tests to the DDLSuite Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13929 from hvanhovell/SPARK-16220.	2016-06-27 16:57:34 -07:00
Bill Chambers	c48c8ebc0a	[SPARK-16220][SQL] Revert Change to Bring Back SHOW FUNCTIONS Functionality ## What changes were proposed in this pull request? - Fix tests regarding show functions functionality - Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality. Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files ## How was this patch tested? Unit tests. Author: Bill Chambers <bill@databricks.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #13916 from anabranch/master.	2016-06-27 11:50:34 -07:00
Takeshi YAMAMURO	3e4e868c85	[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData ## What changes were proposed in this pull request? This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests. ## How was this patch tested? Add a new test suite `MapDataSuite` for comparison tests. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13847 from maropu/UnsafeMapTest.	2016-06-27 21:45:22 +08:00
Sital Kedia	bf665a9586	[SPARK-15958] Make initial buffer size for the Sorter configurable ## What changes were proposed in this pull request? Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable. ## How was this patch tested? Tested by running a job on the cluster. Author: Sital Kedia <skedia@fb.com> Closes #13699 from sitalkedia/config_sort_buffer_upstream.	2016-06-25 09:13:39 +01:00
Takeshi YAMAMURO	d2e44d7db8	[SPARK-16192][SQL] Add type checks in CollectSet ## What changes were proposed in this pull request? `CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`. So, this pr is to add type checks in `CheckAnalysis`. ## How was this patch tested? Added tests to check failures when we found map-typed data in `CollectSet`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13892 from maropu/SPARK-16192.	2016-06-24 21:07:03 -07:00
Sean Owen	158af162ea	[SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3 ## What changes were proposed in this pull request? Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException` ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13843 from srowen/SPARK-16129.	2016-06-24 10:35:54 +01:00
Wenchen Fan	6a3c6276f5	[SQL][MINOR] ParserUtils.operationNotAllowed should throw exception directly ## What changes were proposed in this pull request? It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13874 from cloud-fan/style.	2016-06-23 20:20:55 -07:00
Dongjoon Hyun	91b1ef28d1	[SPARK-16164][SQL] Update `CombineFilters` to try to construct predicates with child predicate first ## What changes were proposed in this pull request? This PR changes `CombineFilters` to compose the final predicate condition by using (`child predicate` AND `parent predicate`) instead of (`parent predicate` AND `child predicate`). This is a best effort approach. Some other optimization rules may destroy this order by reorganizing conjunctive predicates. Reported Error Scenario Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan. ```scala import org.apache.spark.ml.feature._ val df1 = (0 until 3).map(_.toString).toDF val indexer = new StringIndexer() .setInputCol("value") .setOutputCol("idx") .setHandleInvalid("skip") .fit(df1) val df2 = (0 until 5).map(_.toString).toDF val predictions = indexer.transform(df2) predictions.show() // this is okay predictions.where('idx > 2).show() // this will throw an exception ``` Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages. ## How was this patch tested? Pass the Jenkins tests (including a new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13872 from dongjoon-hyun/SPARK-16164.	2016-06-23 15:27:43 -07:00
Davies Liu	10396d9505	[SPARK-16163] [SQL] Cache the statistics for logical plans ## What changes were proposed in this pull request? This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan). During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics. For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore. ## How was this patch tested? Testsed with TPC-DS Q64, it could be planned in a second after the patch. Author: Davies Liu <davies@databricks.com> Closes #13871 from davies/fix_statistics.	2016-06-23 11:48:48 -07:00
Davies Liu	20d411bc5d	[SPARK-16078][SQL] from_utc_timestamp/to_utc_timestamp should not depends on local timezone ## What changes were proposed in this pull request? Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC. In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone. This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST). This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally. ## How was this patch tested? Tested these function in all timezone. Author: Davies Liu <davies@databricks.com> Closes #13784 from davies/convert_tz.	2016-06-22 13:40:24 -07:00
Wenchen Fan	01277d4b25	[SPARK-16097][SQL] Encoders.tuple should handle null object correctly ## What changes were proposed in this pull request? Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13807 from cloud-fan/bug.	2016-06-22 18:32:14 +08:00
Yin Huai	905f774b71	[SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables ## What changes were proposed in this pull request? This PR adds the static partition support to INSERT statement when the target table is a data source table. ## How was this patch tested? New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite. Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change. Author: Yin Huai <yhuai@databricks.com> Closes #13769 from yhuai/SPARK-16030-1.	2016-06-20 20:17:47 +08:00
Yin Huai	6d0f921aed	[SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement ## What changes were proposed in this pull request? This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13766 from yhuai/caseSensitivity.	2016-06-19 21:45:53 -07:00
Davies Liu	001a589603	[SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13652 from davies/fix_timezone.	2016-06-19 00:34:52 -07:00
Wenchen Fan	3d010c8375	[SPARK-16036][SPARK-16037][SQL] fix various table insertion problems ## What changes were proposed in this pull request? The current table insertion has some weird behaviours: 1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table 2. inserting into a partitioned table without partition list has wrong result for hive table. This PR fixes these 2 problems. ## How was this patch tested? new test in hive `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13754 from cloud-fan/insert2.	2016-06-18 10:32:27 -07:00
Reynold Xin	1a65e62a7f	[SPARK-16014][SQL] Rename optimizer rules to be more consistent ## What changes were proposed in this pull request? This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #13732 from rxin/SPARK-16014.	2016-06-17 15:51:20 -07:00
gatorsmile	e5d703bca8	[SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION #### What changes were proposed in this pull request? `IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table. This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification. #### How was this patch tested? Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13447 from gatorsmile/insertIfNotExist.	2016-06-16 22:54:02 -07:00
Pete Robbins	5ada606144	[SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory ## What changes were proposed in this pull request? `UTF8String` and all `Unsafe` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys. This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem. This PR is largely based on the work of robbinspg and he should be credited for this. closes https://github.com/apache/spark/pull/13707 ## How was this patch tested? Manually tested on problematic workloads. Author: Pete Robbins <robbinspg@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13723 from hvanhovell/SPARK-15822-2.	2016-06-16 22:27:32 -07:00
Dongjoon Hyun	2d27eb1e75	[MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and exceptions. ## What changes were proposed in this pull request? This PR contains a few changes on code comments. - `HiveTypeCoercion` is renamed into `TypeCoercion`. - `NoSuchDatabaseException` is only used for the absence of database. - For partition type inference, only `DoubleType` is considered. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13674 from dongjoon-hyun/minor_doc_types.	2016-06-16 14:27:09 -07:00
bomeng	bbad4cb48d	[SPARK-15978][SQL] improve 'show tables' command related codes ## What changes were proposed in this pull request? I've found some minor issues in "show tables" command: 1. In the `SessionCatalog.scala`, `listTables(db: String)` method will call `listTables(formatDatabaseName(db), "*")` to list all the tables for certain db, but in the method `listTables(db: String, pattern: String)`, this db name is formatted once more. So I think we should remove `formatDatabaseName()` in the caller. 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases(). ## How was this patch tested? The existing test cases should cover it. Author: bomeng <bmeng@us.ibm.com> Closes #13695 from bomeng/SPARK-15978.	2016-06-16 14:18:02 -07:00
gatorsmile	6451cf9270	[SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE TABLE AS SELECT #### What changes were proposed in this pull request? ~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~ ~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~ The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists. In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string. #### How was this patch tested? - Added a test case for caching and uncaching qualified table names - Fixed a few test cases that do not drop temp table at the end - Added the related test case for the issue resolved in this PR Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13572 from gatorsmile/cacheTableAsSelect.	2016-06-16 10:01:59 -07:00
Narine Kokhlikyan	7c6c692637	[SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR ## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12836 from NarineK/gapply2.	2016-06-15 21:42:05 -07:00
Sean Zhong	9bd80ad6bd	[SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type ## What changes were proposed in this pull request? This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result. Before the change: ``` scala> sql("select 1/2 as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").show() +---+ \| a\| +---+ \|0 \| +---+ scala> sql("select sum(1 / 2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true)) ``` After the change: ``` scala> sql("select 1/2 as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true)) ``` ## How was this patch tested? Unit test. This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin Author: Sean Zhong <seanzhong@databricks.com> Closes #13651 from clockfly/SPARK-15776.	2016-06-15 14:34:15 -07:00
bomeng	42a28caf10	[SPARK-15952][SQL] fix "show databases" ordering issue ## What changes were proposed in this pull request? Two issues I've found for "show databases" command: 1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list) 2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least. ## How was this patch tested? Updated existing test case to test its ordering as well. Author: bomeng <bmeng@us.ibm.com> Closes #13671 from bomeng/SPARK-15952.	2016-06-14 18:35:29 -07:00
Takuya UESHIN	c5b7355819	[SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult. ## What changes were proposed in this pull request? `DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable. The example is like: ``` val localRelation = Seq(1, 2, 3).toDF() localRelation.createOrReplaceTempView("localRelation") spark.catalog.cacheTable("localRelation") assert( localRelation.queryExecution.withCachedData.collect { case i: InMemoryRelation => i }.size == 1) ``` and this will fail as: ``` ArrayBuffer() had size 0 instead of expected size 1 ``` The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped. Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same. This pr modifies them to use canonicalized plan when override `sameResult` method. ## How was this patch tested? Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13638 from ueshin/issues/SPARK-15915.	2016-06-14 10:52:13 -07:00
Wenchen Fan	688b6ef9dc	[SPARK-15932][SQL][DOC] document the contract of encoder serializer expressions ## What changes were proposed in this pull request? In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple). This PR adds some document and assert in `ExpressionEncoder` to make it clearer. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #13648 from cloud-fan/comment.	2016-06-13 22:02:23 -07:00
Sandeep Singh	1842cdd4ee	[SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions ## What changes were proposed in this pull request? SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions. ## How was this patch tested? CatalogSuite Author: Sandeep Singh <sandeep@techaddict.me> Closes #13413 from techaddict/SPARK-15663.	2016-06-13 21:58:52 -07:00
Sean Zhong	7b9071eeaa	[SPARK-15910][SQL] Check schema consistency when using Kryo encoder to convert DataFrame to Dataset ## What changes were proposed in this pull request? This PR enforces schema check when converting DataFrame to Dataset using Kryo encoder. For example. Before the change: Schema is NOT checked when converting DataFrame to Dataset using kryo encoder. ``` scala> case class B(b: Int) scala> implicit val encoder = Encoders.kryo[B] scala> val df = Seq((1)).toDF("b") scala> val ds = df.as[B] // Schema compatibility is NOT checked ``` After the change: Report AnalysisException since the schema is NOT compatible. ``` scala> val ds = Seq((1)).toDF("b").as[B] org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`b` AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType; ... ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13632 from clockfly/spark-15910.	2016-06-13 17:43:55 -07:00
Herman van Hövell tot Westerflier	1f8f2b5c2a	[SPARK-15370][SQL] Fix count bug # What changes were proposed in this pull request? This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule. After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns `NULL` on empty input. If the expression does not return `NULL`, the rule generates additional logic in the `Project` operator above the rewritten subquery. This additional logic intercepts `NULL` values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input. This PR takes over https://github.com/apache/spark/pull/13155. It only fixes an issue with `Literal` construction and style issues. All credits should go frreiss. # How was this patch tested? Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite`). Ran all existing automated regression tests after merging with latest trunk. Author: frreiss <frreiss@us.ibm.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13629 from hvanhovell/SPARK-15370-cleanup.	2016-06-12 21:30:32 -07:00
Wenchen Fan	f5d38c3925	Revert "[SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter" This reverts commit `0ec279ffdf`.	2016-06-12 16:52:15 -07:00
Herman van Hovell	20b8f2c32a	[SPARK-15370][SQL] Revert PR "Update RewriteCorrelatedSuquery rule" This reverts commit `9770f6ee60`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13626 from hvanhovell/SPARK-15370-revert.	2016-06-12 15:06:37 -07:00
Ioana Delaney	0ff8a68b9f	[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException ## What changes were proposed in this pull request? Queries with embedded existential sub-query predicates throws exception when building the physical plan. Example failing query: ```SQL scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1") scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2") scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show() Binding attribute, tree: c2#239 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) ... at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52) ``` Problem description: When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build. ```SQL == Optimized Logical Plan == Project [_1#224 AS c1#227] +- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262) : +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)] : +- LocalRelation [c2#239] :- LocalRelation [_1#224, _2#225] +- LocalRelation [c2#228#262] == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239 ``` Solution: In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below. ```SQL == Optimized Logical Plan == Project [_1#224 AS c1#227] +- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284) :- Join ExistenceJoin(exists#285), (_2#225 = c2#239) : :- LocalRelation [_1#224, _2#225] : +- LocalRelation [c2#239] +- LocalRelation [c2#228#284] == Physical Plan == Project [_1#224 AS c1#227] +- BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight : :- LocalTableScan [_1#224, _2#225] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- LocalTableScan [c2#239] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [c2#228#284] +- LocalTableScan [c222#36], [[111],[222]] ``` ## How was this patch tested? Added new test cases in SubquerySuite.scala Author: Ioana Delaney <ioanamdelaney@gmail.com> Closes #13570 from ioana-delaney/fixEmbedSubPredV1.	2016-06-12 14:26:29 -07:00
frreiss	9770f6ee60	[SPARK-15370][SQL] Update RewriteCorrelatedScalarSubquery rule to fix COUNT bug ## What changes were proposed in this pull request? This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule. After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns NULL on empty input. If the expression does not return NULL, the rule generates additional logic in the Project operator above the rewritten subquery. This additional logic intercepts NULL values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input. ## How was this patch tested? Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite.scala`). Ran all existing automated regression tests after merging with latest trunk. Author: frreiss <frreiss@us.ibm.com> Closes #13155 from frreiss/master.	2016-06-12 14:21:10 -07:00
Eric Liang	e1f986c7a3	[SPARK-15860] Metrics for codegen size and perf ## What changes were proposed in this pull request? Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get. To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv. ## How was this patch tested? Unit tests Author: Eric Liang <ekl@databricks.com> Closes #13586 from ericl/spark-15860.	2016-06-11 23:16:21 -07:00
Eric Liang	c06c58bbbb	[SPARK-14851][CORE] Support radix sort with nullable longs ## What changes were proposed in this pull request? This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort. This strategy for nulls does mean the sort is no longer stable. cc davies ## How was this patch tested? Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts. Some test queries (best of 5 runs each). Before change: scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190437233227987 res3: Double = 4716.471091 After change: scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190367870952791 res4: Double = 2981.143045 Author: Eric Liang <ekl@databricks.com> Closes #13161 from ericl/sc-2998.	2016-06-11 15:42:58 -07:00
Sameer Agarwal	468da03e23	[SPARK-15678] Add support to REFRESH data source paths ## What changes were proposed in this pull request? Spark currently incorrectly continues to use cached data even if the underlying data is overwritten. Current behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset ``` This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. Expected behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) spark.catalog.refreshResource(dir) sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset ``` ## How was this patch tested? Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #13566 from sameeragarwal/refresh-path-2.	2016-06-10 20:43:18 -07:00
Cheng Lian	8e7b56f3d4	Revert "[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader" This reverts commit `bba5d7999f`.	2016-06-10 20:41:48 -07:00
Liang-Chi Hsieh	bba5d7999f	[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader ## What changes were proposed in this pull request? The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader. ## How was this patch tested? Existing tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13371 from viirya/vectorized-reader-push-down-filter.	2016-06-10 18:23:59 -07:00
Narine Kokhlikyan	54f758b5fc	[SPARK-15884][SPARKR][SQL] Overriding stringArgs in MapPartitionsInR ## What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/12836 we need to override stringArgs method in MapPartitionsInR in order to avoid too large strings generated by "stringArgs" method based on the input arguments. In this case exclude some of the input arguments: serialized R objects. ## How was this patch tested? Existing test cases Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #13610 from NarineK/dapply_MapPartitionsInR_stringArgs.	2016-06-10 17:17:47 -07:00
Sela	127a6678d7	[SPARK-15489][SQL] Dataset kryo encoder won't load custom user settings ## What changes were proposed in this pull request? Serializer instantiation will consider existing SparkConf ## How was this patch tested? manual test with `ImmutableList` (Guava) and `kryo-serializers`'s `Immutable*Serializer` implementations. Added Test Suite. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Sela <ansela@paypal.com> Closes #13424 from amitsela/SPARK-15489.	2016-06-10 14:36:51 -07:00
Herman van Hovell	e05a2feebe	[SPARK-15825] [SQL] Fix SMJ invalid results ## What changes were proposed in this pull request? Code generated `SortMergeJoin` failed with wrong results when using structs as keys. This could (eventually) be traced back to the use of a wrong row reference when comparing structs. ## How was this patch tested? TBD Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13589 from hvanhovell/SPARK-15822.	2016-06-10 14:29:05 -07:00
wangyang	026eb90644	[SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests Author: wangyang <wangyang@haizhi.com> Closes #13601 from yangw1234/isEmpty.	2016-06-10 13:10:03 -07:00
Sandeep Singh	865ec32dd9	[MINOR][X][X] Replace all occurrences of None: Option with Option.empty ## What changes were proposed in this pull request? Replace all occurrences of `None: Option[X]` with `Option.empty[X]` ## How was this patch tested? Exisiting Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13591 from techaddict/minor-7.	2016-06-10 13:06:51 -07:00
Takuya UESHIN	667d4ea7b3	[SPARK-6320][SQL] Move planLater method into GenericStrategy. ## What changes were proposed in this pull request? This PR moves `QueryPlanner.planLater()` method into `GenericStrategy` for extra strategies to be able to use `planLater` in its strategy. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13147 from ueshin/issues/SPARK-6320.	2016-06-10 13:06:18 -07:00
Liang-Chi Hsieh	0ec279ffdf	[SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter ## What changes were proposed in this pull request? This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13496 from viirya/move-analyzer-stuff.	2016-06-10 11:05:04 -07:00
Tathagata Das	abdb5d42c5	[SPARK-15812][SQ][STREAMING] Added support for sorting after streaming aggregation with complete mode ## What changes were proposed in this pull request? When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this PR, I am just adding support for sorting, as it is a common useful functionality. Support for other operations will come later. ## How was this patch tested? Additional unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13549 from tdas/SPARK-15812.	2016-06-10 10:48:28 -07:00
Eric Liang	b914e1930f	[SPARK-15794] Should truncate toString() of very wide plans ## What changes were proposed in this pull request? With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact. It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability. ## How was this patch tested? Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold. ``` numFields = 5 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0 23364.4 0.1X numFields = 25 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0 42367.9 0.1X numFields = 100 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X numFields = Infinity wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ [info] java.lang.OutOfMemoryError: Java heap space ``` Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13537 from ericl/truncated-string.	2016-06-09 18:05:16 -07:00
Herman van Hovell	b0768538e5	[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions ## What changes were proposed in this pull request? The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue. This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan. This PR closes https://github.com/apache/spark/pull/13522 ## How was this patch tested? Current tests. Author: Herman van Hovell <hvanhovell@databricks.com> Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #13581 from hvanhovell/SPARK-14321.	2016-06-09 16:37:18 -07:00
Kevin Yu	99386fe398	[SPARK-15804][SQL] Include metadata in the toStructType ## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu <qyu@us.ibm.com> Closes #13555 from kevinyu98/spark-15804.	2016-06-09 09:50:09 -07:00
Herman van Hovell	91fbc880b6	[SPARK-15789][SQL] Allow reserved keywords in most places ## What changes were proposed in this pull request? The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```). ## How was this patch tested? I have added/move/renamed test in the catalyst `*ParserSuite`s. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13534 from hvanhovell/SPARK-15789.	2016-06-07 17:01:11 -07:00
Sean Zhong	890baaca50	[SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead ## What changes were proposed in this pull request? The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead. This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)): > A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query Example: ``` scala> spark.sql("CREATE temporary view my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") scala> spark.sql("select c1, c2 from my_tab7").show() +----+-----+ \| c1\| c2\| +----+-----+ \|year\| make\| \|2012\|Tesla\| ... ``` It NOW prints a deprecation warning if "CREATE TEMPORARY TABLE USING..." is used. ``` scala> spark.sql("CREATE temporary table my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") 16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13414 from clockfly/create_temp_view_using.	2016-06-07 15:21:55 -07:00
Sean Zhong	5f731d6859	[SPARK-15792][SQL] Allows operator to change the verbosity in explain output ## What changes were proposed in this pull request? This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan. Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future. Less verbose mode: dataframe.explain(extended = false) `output=[count(a)#85L]` is NOT displayed for HashAggregate. ``` scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2") scala> spark.sql("select count(a) from df2").explain() == Physical Plan == HashAggregate(key=[], functions=[count(1)]) +- Exchange SinglePartition +- HashAggregate(key=[], functions=[partial_count(1)]) +- LocalTableScan ``` Verbose mode: dataframe.explain(extended = true) `output=[count(a)#85L]` is displayed for HashAggregate. ``` scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added ... == Physical Plan == HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L]) +- Exchange SinglePartition +- HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L]) +- LocalTableScan ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13535 from clockfly/verbose_breakdown_2.	2016-06-06 22:59:25 -07:00
Sean Zhong	0e0904a2fc	[SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema ## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. Before the change: ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. After the change: ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13529 from clockfly/spark-15632.	2016-06-06 22:40:21 -07:00
Josh Rosen	0b8d694999	[SPARK-15764][SQL] Replace N^2 loop in BindReferences BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh Rosen <joshrosen@databricks.com> Closes #13505 from JoshRosen/bind-references-improvement.	2016-06-06 11:44:51 -07:00
Zheng RuiFeng	fd8af39713	[MINOR] Fix Typos 'an -> a' ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.	2016-06-06 09:35:47 +01:00
Wenchen Fan	30c4774f33	[SPARK-15657][SQL] RowEncoder should validate the data type of input object ## What changes were proposed in this pull request? This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column. This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it. ## How was this patch tested? new tests in `RowEncoderSuite`. Author: Wenchen Fan <wenchen@databricks.com> Closes #13401 from cloud-fan/bug.	2016-06-05 15:59:52 -07:00
Weiqing Yang	0f307db5e1	[SPARK-15707][SQL] Make Code Neat - Use map instead of if check. ## What changes were proposed in this pull request? In forType function of object RandomDataGenerator, the code following: if (maybeSqlTypeGenerator.isDefined){ .... Some(generator) } else{ None } will be changed. Instead, maybeSqlTypeGenerator.map will be used. ## How was this patch tested? All of the current unit tests passed. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #13448 from Sherry302/master.	2016-06-04 22:44:03 +01:00
Josh Rosen	091f81e1f7	[SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads. We should also cache `StructType.hashCode`. In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch. Author: Josh Rosen <joshrosen@databricks.com> Closes #13504 from JoshRosen/metadata-fix.	2016-06-04 14:14:50 -07:00
Wenchen Fan	11c83f83d5	[SPARK-15140][SQL] make the semantics of null input object for encoder clear ## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null. This PR explicitly add this constraint and throw exception if users break it. ## How was this patch tested? several new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #13469 from cloud-fan/null-object.	2016-06-03 14:28:19 -07:00
Wenchen Fan	61b80d552a	[SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema ## What changes were proposed in this pull request? There are 2 kinds of `GetStructField`: 1. resolved from `UnresolvedExtractValue`, and it will have a `name` property. 2. created when we build deserializer expression for nested tuple, no `name` property. When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property. ## How was this patch tested? new test in `EncoderResolutionSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13474 from cloud-fan/ordinal-check.	2016-06-03 14:26:24 -07:00
gatorsmile	eb10b481ca	[SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED #### What changes were proposed in this pull request? Before this PR, the output of EXPLAIN of following SQL is like ```SQL CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING) PARTITIONED BY (ds STRING, hr STRING) LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9' ``` ``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false`` After this PR, the output is like ``` ExecutedCommand : +- CreateTableCommand CatalogTable( Table:`extTable_with_partitions` Created:Thu Jun 02 21:30:54 PDT 2016 Last Access:Wed Dec 31 15:59:59 PST 1969 Type:EXTERNAL Schema:[`key` int, `value` string, `ds` string, `hr` string] Partition Columns:[`ds`, `hr`] Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false ``` This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng #### How was this patch tested? Manual testing Author: gatorsmile <gatorsmile@gmail.com> Closes #13070 from gatorsmile/betterExplainCatalogTable.	2016-06-03 13:56:22 -07:00
Josh Rosen	e526913989	[SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array. For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456). ### Perf results (from #13456 benchmarks) Before ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz parsing large select: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 1 select expressions 19 / 22 0.0 19119858.0 1.0X 10 select expressions 23 / 25 0.0 23208774.0 0.8X 100 select expressions 55 / 73 0.0 54768402.0 0.3X 1000 select expressions 229 / 259 0.0 228606373.0 0.1X 2500 select expressions 530 / 554 0.0 529938178.0 0.0X ``` After ``` parsing large select: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 1 select expressions 15 / 21 0.0 14978203.0 1.0X 10 select expressions 22 / 27 0.0 22492262.0 0.7X 100 select expressions 48 / 64 0.0 48449834.0 0.3X 1000 select expressions 189 / 208 0.0 189346428.0 0.1X 2500 select expressions 429 / 449 0.0 428943897.0 0.0X ``` ### Author: Josh Rosen <joshrosen@databricks.com> Closes #13484 from JoshRosen/treenode-productiterator-map.	2016-06-03 13:53:02 -07:00
Ioana Delaney	9e2eb13ca5	[SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException ## What changes were proposed in this pull request? Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw UnsupportedOperationException exception. Problem repro: ```SQL scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1") scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2") scala> sql("select (select min(c1) from t2) from t1").show() java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 [] at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215) at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473) ``` The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down a scalar-subquery expression to the local tables. The solution prevents the rule to apply if Project references scalar subqueries. ## How was this patch tested? Added regression tests to SubquerySuite.scala Author: Ioana Delaney <ioanamdelaney@gmail.com> Closes #13418 from ioana-delaney/scalarSubV2.	2016-06-03 12:04:27 -07:00
Wenchen Fan	190ff274fd	[SPARK-15494][SQL] encoder code cleanup ## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #13269 from cloud-fan/clean-encoder.	2016-06-03 00:43:02 -07:00
Sean Zhong	6dde27404c	[SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc. ## What changes were proposed in this pull request? This PR makes the explain output less verbose by hiding some verbose output like `None`, `null`, empty List `[]`, empty set `{}`, and etc. Before change: ``` == Physical Plan == ExecutedCommand : +- ShowTablesCommand None, None ``` After change: ``` == Physical Plan == ExecutedCommand : +- ShowTablesCommand ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13470 from clockfly/verbose_breakdown_4.	2016-06-02 22:45:37 -07:00
Wenchen Fan	6323e4bd76	[SPARK-15732][SQL] better error message when use java reserved keyword as field name ## What changes were proposed in this pull request? When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime. This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13485 from cloud-fan/java.	2016-06-02 18:13:04 -07:00
Andrew Or	d1c1fbc345	[SPARK-15715][SQL] Fix alter partition with storage information in Hive ## What changes were proposed in this pull request? This command didn't work for Hive tables. Now it does: ``` ALTER TABLE boxes PARTITION (width=3) SET SERDE 'com.sparkbricks.serde.ColumnarSerDe' WITH SERDEPROPERTIES ('compress'='true') ``` ## How was this patch tested? `HiveExternalCatalogSuite` Author: Andrew Or <andrew@databricks.com> Closes #13453 from andrewor14/alter-partition-storage.	2016-06-02 17:44:48 -07:00
Sean Zhong	985d532812	[SPARK-15734][SQL] Avoids printing internal row in explain output ## What changes were proposed in this pull request? This PR avoids printing internal rows in explain output for some operators. Before change: ``` scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3") scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#37,_2#38,_3#39] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]] ... == Physical Plan == LocalTableScan [_1#37,_2#38,_3#39] ``` After change: ``` scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#58,_2#59,_3#60] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#58,_2#59,_3#60] ... == Physical Plan == LocalTableScan <empty>, [_1#58,_2#59,_3#60] ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13471 from clockfly/verbose_breakdown_5.	2016-06-02 16:21:33 -07:00
Sameer Agarwal	09b3c56c91	[SPARK-14752][SQL] Explicitly implement KryoSerialization for LazilyGenerateOrdering ## What changes were proposed in this pull request? This patch fixes a number of `com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException` exceptions reported in [SPARK-15604], [SPARK-14752] etc. (while executing sparkSQL queries with the kryo serializer) by explicitly implementing `KryoSerialization` for `LazilyGenerateOrdering`. ## How was this patch tested? 1. Modified `OrderingSuite` so that all tests in the suite also test kryo serialization (for both interpreted and generated ordering). 2. Manually verified TPC-DS q1. Author: Sameer Agarwal <sameer@databricks.com> Closes #13466 from sameeragarwal/kryo.	2016-06-02 10:58:00 -07:00
Dongjoon Hyun	63b7f127ca	[SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer ## What changes were proposed in this pull request? This issue add a new optimizer `ReorderAssociativeOperator` by taking advantage of integral associative property. Currently, Spark works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This PR can handle Case 2 for Add/Multiply expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparison between `before` and `after` this issue. Before ```scala scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(((((((((a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] scala> sql("select a123456789 from (select explode(array(1)) a)").explain == Physical Plan == Project [(((((((((a#18 * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19] +- Generate explode([1]), false, false, [a#18] +- Scan OneRowRelation[] ``` After ```scala scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] scala> sql("select a123456789 from (select explode(array(1)) a)").explain == Physical Plan == Project [(a#18 * 362880) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19] +- Generate explode([1]), false, false, [a#18] +- Scan OneRowRelation[] ``` This PR is greatly generalized by cloud-fan 's key ideas; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests including new testsuite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12850 from dongjoon-hyun/SPARK-15076.	2016-06-02 09:48:58 -07:00
Takeshi YAMAMURO	5eea332307	[SPARK-13484][SQL] Prevent illegal NULL propagation when filtering outer-join results ## What changes were proposed in this pull request? This PR add a rule at the end of analyzer to correct nullable fields of attributes in a logical plan by using nullable fields of the corresponding attributes in its children logical plans (these plans generate the input rows). This is another approach for addressing SPARK-13484 (the first approach is https://github.com/apache/spark/pull/11371). Close #113711 Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Author: Yin Huai <yhuai@databricks.com> Closes #13290 from yhuai/SPARK-13484.	2016-06-01 22:23:00 -07:00
jerryshao	8288e16a5a	[SPARK-15620][SQL] Fix transformed dataset attributes revolve failure ## What changes were proposed in this pull request? Join on transformed dataset has attributes conflicts, which make query execution failure, for example: ``` val dataset = Seq(1, 2, 3).toDs val mappedDs = dataset.map(_ + 1) mappedDs.as("t1").joinWith(mappedDs.as("t2"), $"t1.value" === $"t2.value").show() ``` will throw exception: ``` org.apache.spark.sql.AnalysisException: cannot resolve '`t1.value`' given input columns: [value]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) ``` ## How was this patch tested? Unit test. Author: jerryshao <sshao@hortonworks.com> Closes #13399 from jerryshao/SPARK-15620.	2016-06-01 21:58:05 -07:00
Sean Zhong	c8fb776d4a	[SPARK-15692][SQL] Improves the explain output of several physical plans by displaying embedded logical plan in tree style ## What changes were proposed in this pull request? Improves the explain output of several physical plans by displaying embedded logical plan in tree style Some physical plan contains a embedded logical plan, for example, `cache tableName query` maps to: ``` case class CacheTableCommand( tableName: String, plan: Option[LogicalPlan], isLazy: Boolean) extends RunnableCommand ``` It is easier to read the explain output if we can display the `plan` in tree style. Before change: Everything is messed in one line. ``` scala> Seq((1,2)).toDF().createOrReplaceTempView("testView") scala> spark.sql("cache table testView2 select * from testView").explain() == Physical Plan == ExecutedCommand CacheTableCommand testView2, Some('Project [] +- 'UnresolvedRelation `testView`, None ), false ``` After change:* ``` scala> spark.sql("cache table testView2 select * from testView").explain() == Physical Plan == ExecutedCommand : +- CacheTableCommand testView2, false : : +- 'Project [*] : : +- 'UnresolvedRelation `testView`, None ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13433 from clockfly/verbose_breakdown_3_2.	2016-06-01 17:03:39 -07:00
Wenchen Fan	8640cdb836	[SPARK-15441][SQL] support null object in Dataset outer-join ## What changes were proposed in this pull request? Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null. This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object. This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct. ## How was this patch tested? new test in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13425 from cloud-fan/outer-join2.	2016-06-01 16:16:54 -07:00
Cheng Lian	7bb64aae27	[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <lian@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix.	2016-06-01 16:02:27 -07:00
Reynold Xin	a71d1364ae	[SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming ## What changes were proposed in this pull request? This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them. ## How was this patch tested? Updated tests to reflect the moves. Author: Reynold Xin <rxin@databricks.com> Closes #13429 from rxin/SPARK-15686.	2016-06-01 10:14:40 -07:00
Sean Zhong	d5012c2740	[SPARK-15495][SQL] Improve the explain output for Aggregation operator ## What changes were proposed in this pull request? This PR improves the explain output of Aggregator operator. SQL: ``` Seq((1,2,3)).toDF("a", "b", "c").createTempView("df1") spark.sql("cache table df1") spark.sql("select count(a), count(c), b from df1 group by b").explain() ``` Before change: ``` TungstenAggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8]) +- Exchange hashpartitioning(b#8, 200), None +- TungstenAggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L]) +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1) `````` After change: ``` Aggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8]) +- Exchange hashpartitioning(b#8, 200), None +- Aggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L]) +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1) ``` ## How was this patch tested? Manual test and existing UT. Author: Sean Zhong <seanzhong@databricks.com> Closes #13363 from clockfly/verbose3.	2016-06-01 09:58:01 -07:00
jerryshao	e4ce1bc4f3	[SPARK-15659][SQL] Ensure FileSystem is gotten from path ## What changes were proposed in this pull request? Currently `spark.sql.warehouse.dir` is pointed to local dir by default, which will throw exception when HADOOP_CONF_DIR is configured and default FS is hdfs. ``` java.lang.IllegalArgumentException: Wrong FS: file:/Users/sshao/projects/apache-spark/spark-warehouse, expected: hdfs://localhost:8020 ``` So we should always get the `FileSystem` from `Path` to avoid wrong FS problem. ## How was this patch tested? Local test. Author: jerryshao <sshao@hortonworks.com> Closes #13405 from jerryshao/SPARK-15659.	2016-06-01 08:28:19 -05:00
Eric Liang	93e97147eb	[MINOR] Slightly better error message when attempting to query hive tables w/in-mem catalog andrewor14 Author: Eric Liang <ekl@databricks.com> Closes #13427 from ericl/better-error-msg.	2016-05-31 17:39:03 -07:00
Josh Rosen	8ca01a6feb	[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues ## What changes were proposed in this pull request? In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase. The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying. In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful). ## How was this patch tested? This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds. Author: Josh Rosen <joshrosen@databricks.com> Closes #13421 from JoshRosen/disable-line-comments-in-codegen.	2016-05-31 17:30:03 -07:00
Tathagata Das	90b11439b3	[SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming ## What changes were proposed in this pull request? Currently structured streaming only supports append output mode. This PR adds the following. - Added support for Complete output mode in the internal state store, analyzer and planner. - Added public API in Scala and Python for users to specify output mode - Added checks for unsupported combinations of output mode and DF operations - Plans with no aggregation should support only Append mode - Plans with aggregation should support only Update and Complete modes - Default output mode is Append mode (Question: should we change this to automatically set to Complete mode when there is aggregation?) - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported. ## How was this patch tested? Unit tests in various test suites - StreamingAggregationSuite: tests for complete mode - MemorySinkSuite: tests for checking behavior in Append and Complete modes. - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs - Python doc test and existing unit tests modified to call write.outputMode. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13286 from tdas/complete-mode.	2016-05-31 15:57:01 -07:00
Dilip Biswal	dfe2cbeb43	[SPARK-15557] [SQL] cast the string into DoubleType when it's used together with decimal In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`. I think we need to cast the string literals to Double type instead. I looked at the history and found that this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double. To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double. Here is the output. - Hive ```SQL hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1; OK hive> describe l2; OK _c0 double ``` - MySQL ```SQL mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test; Query OK, 1 row affected (0.01 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> describe foo2; +-----------------------------------+--------+------+-----+---------+-------+ \| Field \| Type \| Null \| Key \| Default \| Extra \| +-----------------------------------+--------+------+-----+---------+-------+ \| (cast(99 as decimal(19,6)) + '2') \| double \| NO \| \| 0 \| \| +-----------------------------------+--------+------+-----+---------+-------+ ``` ## How was this patch tested? Added a new test in SQLQuerySuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #13368 from dilipbiswal/spark-15557.	2016-05-31 15:49:45 -07:00
Davies Liu	2df6ca848e	[SPARK-15327] [SQL] fix split expression in whole stage codegen ## What changes were proposed in this pull request? Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted. This PR will not split the code in whole-stage codegen. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13235 from davies/fix_nested_codegen.	2016-05-31 15:36:02 -07:00
Yin Huai	c6de5832bf	[SPARK-15622][SQL] Wrap the parent classloader of Janino's classloader in the ParentClassLoader. ## What changes were proposed in this pull request? At https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, Janino's classloader throws the exception when its parent throws a ClassNotFoundException with a cause set. However, it does not throw the exception when there is no cause set. Seems we need to use a special ClassLoader to wrap the actual parent classloader set to Janino handle this behavior. ## How was this patch tested? I have reverted the workaround made by https://issues.apache.org/jira/browse/SPARK-11636 ( https://github.com/apache/spark/compare/master...yhuai:SPARK-15622?expand=1#diff-bb538fda94224dd0af01d0fd7e1b4ea0R81) and `test-only *ReplSuite -- -z "SPARK-2576 importing implicits"` still passes the test (without the change in `CodeGenerator`, this test does not pass with the change in `ExecutorClassLoader `). Author: Yin Huai <yhuai@databricks.com> Closes #13366 from yhuai/SPARK-15622.	2016-05-31 12:30:34 -07:00
Wenchen Fan	2bfed1a0c5	[SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType ## What changes were proposed in this pull request? When we build serializer for UDT object, we should declare its data type as udt instead of udt.sqlType, or if we deserialize it again, we lose the information that it's a udt object and throw analysis exception. ## How was this patch tested? new test in `UserDefiendTypeSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13402 from cloud-fan/udt.	2016-05-31 11:00:38 -07:00
gatorsmile	d67c82e4b6	[SPARK-15647][SQL] Fix Boundary Cases in OptimizeCodegen Rule #### What changes were proposed in this pull request? The following condition in the Optimizer rule `OptimizeCodegen` is not right. ```Scala branches.size < conf.maxCaseBranchesForCodegen ``` - The number of branches in case when clause should be `branches.size + elseBranch.size`. - `maxCaseBranchesForCodegen` is the maximum boundary for enabling codegen. Thus, we should use `<=` instead of `<`. This PR is to fix this boundary case and also add missing test cases for verifying the conf `MAX_CASES_BRANCHES`. #### How was this patch tested? Added test cases in `SQLConfSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13392 from gatorsmile/maxCaseWhen.	2016-05-31 10:08:00 -07:00
Takeshi YAMAMURO	95db8a44f3	[SPARK-15528][SQL] Fix race condition in NumberConverter ## What changes were proposed in this pull request? A local variable in NumberConverter is wrongly shared between threads. This pr fixes the race condition. ## How was this patch tested? Manually checked. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13391 from maropu/SPARK-15528.	2016-05-31 07:25:16 -05:00
Reynold Xin	675921040e	[SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext ## What changes were proposed in this pull request? This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13370 from rxin/SPARK-15638.	2016-05-30 22:47:58 -07:00
Cheng Lian	1360a6d636	[SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema ## What changes were proposed in this pull request? `EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema. However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter does change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data. This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs. ## How was this patch tested? New test case added in `DatasetSuite`. [1]: https://issues.apache.org/jira/browse/SPARK-15632 Author: Cheng Lian <lian@databricks.com> Closes #13362 from liancheng/spark-15112-corrupted-filter.	2016-05-29 23:19:12 -07:00
Sean Owen	ce1572d16f	[MINOR] Resolve a number of miscellaneous build warnings ## What changes were proposed in this pull request? This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately. ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13377 from srowen/BuildWarnings.	2016-05-29 16:48:14 -05:00
Reynold Xin	472f16181d	[SPARK-15636][SQL] Make aggregate expressions more concise in explain ## What changes were proposed in this pull request? This patch reduces the verbosity of aggregate expressions in explain (but does not actually remove any information). As an example, for the following command: ``` spark.range(10).selectExpr("sum(id) + 1", "count(distinct id)").explain(true) ``` Output before this patch: ``` == Physical Plan == TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false),(count(id#0L),mode=Final,isDistinct=true)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L]) +- Exchange SinglePartition, None +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false),(count(id#0L),mode=Partial,isDistinct=true)], output=[sum#18L,count#21L]) +- TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false)], output=[id#0L,sum#18L]) +- Exchange hashpartitioning(id#0L, 5), None +- TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[id#0L,sum#18L]) +- Range (0, 10, splits=2) ``` Output after this patch: ``` == Physical Plan == TungstenAggregate(key=[], functions=[sum(id#0L),count(distinct id#0L)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L]) +- Exchange SinglePartition, None +- TungstenAggregate(key=[], functions=[merge_sum(id#0L),partial_count(distinct id#0L)], output=[sum#18L,count#21L]) +- TungstenAggregate(key=[id#0L], functions=[merge_sum(id#0L)], output=[id#0L,sum#18L]) +- Exchange hashpartitioning(id#0L, 5), None +- TungstenAggregate(key=[id#0L], functions=[partial_sum(id#0L)], output=[id#0L,sum#18L]) +- Range (0, 10, splits=2) ``` Note the change from `(sum(id#0L),mode=PartialMerge,isDistinct=false)` to `merge_sum(id#0L)`. In general aggregate explain is still very verbose, but further work will be done as follow-up pull requests. ## How was this patch tested? Tested manually. Author: Reynold Xin <rxin@databricks.com> Closes #13367 from rxin/SPARK-15636.	2016-05-28 14:14:36 -07:00
Liang-Chi Hsieh	f1b220eeee	[SPARK-15553][SQL] Dataset.createTempView should use CreateViewCommand ## What changes were proposed in this pull request? Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13327 from viirya/dataset-createtempview.	2016-05-27 21:24:08 -07:00
Zheng RuiFeng	6b1a6180e7	[MINOR] Fix Typos 'a -> an' ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml//scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.	2016-05-26 22:39:14 -07:00
Sean Zhong	b5859e0bb8	[SPARK-13445][SQL] Improves error message and add test coverage for Window function ## What changes were proposed in this pull request? Add more verbose error message when order by clause is missed when using Window function. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13333 from clockfly/spark-13445.	2016-05-26 14:50:00 -07:00
Andrew Or	ee682fe293	[SPARK-15534][SPARK-15535][SQL] Truncate table fixes ## What changes were proposed in this pull request? Two changes: - When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions. - Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #13302 from andrewor14/truncate-table.	2016-05-25 15:08:39 -07:00
lfzCarlosC	02c8072eea	[MINOR][MLLIB][STREAMING][SQL] Fix typos fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.	2016-05-25 10:53:57 -07:00
Reynold Xin	4f27b8dd58	[SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions ## What changes were proposed in this pull request? This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly. ## How was this patch tested? Created a new SparkSqlParserSuite. Author: Reynold Xin <rxin@databricks.com> Closes #13292 from rxin/SPARK-15436.	2016-05-25 19:17:53 +02:00
Wenchen Fan	50b660d725	[SPARK-15498][TESTS] fix slow tests ## What changes were proposed in this pull request? This PR fixes 3 slow tests: 1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit". 2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size. 3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13273 from cloud-fan/test.	2016-05-24 21:23:39 -07:00
Dongjoon Hyun	f08bf587b1	[SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException ## What changes were proposed in this pull request? Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases. Before ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0).collect() res1: Array[Int] = Array() // empty scala> spark.sql("select 1").coalesce(0) res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").coalesce(0).collect() java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. scala> spark.sql("select 1").repartition(0) res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").repartition(0).collect() res4: Array[org.apache.spark.sql.Row] = Array() // empty ``` After ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13282 from dongjoon-hyun/SPARK-15512.	2016-05-24 18:55:23 -07:00
Dongjoon Hyun	f8763b80ec	[SPARK-13135] [SQL] Don't print expressions recursively in generated code ## What changes were proposed in this pull request? This PR is an up-to-date and a little bit improved version of #11019 of rxin for - (1) preventing recursive printing of expressions in generated code. Since the major function of this PR is indeed the above, he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation. - (2) Improve multiline comment indentation. - (3) Reduce the number of empty lines (mainly consecutive empty lines). - (4) Remove all space characters on empty lines. Example ```scala spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6) ``` Before ``` Generated code: /* 001 / public Object generate(Object[] references) { ... / 005 / /* /* 006 / Codegend pipeline for /* 007 / Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 / +- Range 1, 1, 8, 999, [id#0L] /* 009 / / ... /* 075 / // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 076 / / 077 / // PRODUCE: Range 1, 1, 8, 999, [id#0L] / 078 / / 079 / // initialize Range ... / 092 / // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 093 / / 094 / // CONSUME: WholeStageCodegen / 095 / / 096 / // (((input[0, bigint, false] + 1) + 2) + 3) / 097 / // ((input[0, bigint, false] + 1) + 2) / 098 / // (input[0, bigint, false] + 1) ... / 107 / // (((input[0, bigint, false] + 4) + 5) + 6) / 108 / // ((input[0, bigint, false] + 4) + 5) / 109 / // (input[0, bigint, false] + 4) ... / 126 / } ``` After* ``` Generated code: /* 001 / public Object generate(Object[] references) { ... / 005 / /* /* 006 / Codegend pipeline for /* 007 / Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 / +- Range 1, 1, 8, 999, [id#0L] /* 009 / / ... /* 075 / // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 076 / // PRODUCE: Range 1, 1, 8, 999, [id#0L] / 077 / // initialize Range ... / 090 / // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 091 / // CONSUME: WholeStageCodegen / 092 / // (((input[0, bigint, false] + 1) + 2) + 3) ... / 101 / // (((input[0, bigint, false] + 4) + 5) + 6) ... / 118 */ } ``` ## How was this patch tested? Pass the Jenkins tests and see the result of the following command manually. ```scala scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen() ``` Author: Dongjoon Hyun <dongjoonapache.org> Author: Reynold Xin <rxindatabricks.com> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13192 from dongjoon-hyun/SPARK-13135.	2016-05-24 10:08:14 -07:00
Daoyuan Wang	d642b27354	[SPARK-15397][SQL] fix string udf locate as hive ## What changes were proposed in this pull request? in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0. ## How was this patch tested? tested with modified `StringExpressionsSuite` and `StringFunctionsSuite` Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #13186 from adrian-wang/locate.	2016-05-23 23:29:15 -07:00
Andrew Or	de726b0d53	Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB" This reverts commit `fa244e5a90`.	2016-05-23 21:43:11 -07:00
Kazuaki Ishizaki	fa244e5a90	[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB ## What changes were proposed in this pull request? This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method. ## How was this patch tested? Added new tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #13243 from kiszk/SPARK-15285.	2016-05-23 21:12:34 -07:00
gatorsmile	5afd927a47	[SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory Catalog #### What changes were proposed in this pull request? So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example, ```SQL CREATE TABLE tabName(i INT, j STRING) SELECT * FROM tabName INSERT OVERWRITE TABLE tabName SELECT 1, 'a' ``` In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT` ``` org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None; ``` This PR is to issue appropriate exceptions in this case. The message will be like ``` org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`; ``` #### How was this patch tested? Added a test case in `DDLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13093 from gatorsmile/selectAfterCreate.	2016-05-23 18:03:45 -07:00
Xin Wu	01659bc50c	[SPARK-15431][SQL] Support LIST FILE(s)\|JAR(s) command natively ## What changes were proposed in this pull request? Currently command `ADD FILE\|JAR <filepath \| jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)\|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] \| JAR[s] [jarfile ...])` ### For example: ##### LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +----------------------------------------------+ \|result \| +----------------------------------------------+ \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt\| +----------------------------------------------+ scala> spark.sql("list files").show(false) +----------------------------------------------+ \|result \| +----------------------------------------------+ \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt\| \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt \| +----------------------------------------------+ ``` ##### LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +---------------------------------------------+ \|result \| +---------------------------------------------+ \|spark://192.168.1.234:50131/jars/TestUDTF.jar\| +---------------------------------------------+ scala> spark.sql("list jars").show(false) +---------------------------------------------+ \|result \| +---------------------------------------------+ \|spark://192.168.1.234:50131/jars/TestUDTF.jar\| +---------------------------------------------+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xinwu@us.ibm.com> Author: xin Wu <xinwu@us.ibm.com> Closes #13212 from xwu0226/list_command.	2016-05-23 17:32:01 -07:00
Dongjoon Hyun	37c617e4f5	[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions ## What changes were proposed in this pull request? Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that. ## How was this patch tested? It's only about docs. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13087 from dongjoon-hyun/SPARK-15282.	2016-05-23 14:19:25 -07:00
Andrew Or	2585d2b322	[SPARK-15279][SQL] Catch conflicting SerDe when creating table ## What changes were proposed in this pull request? The user may do something like: ``` CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde' CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde' ``` None of these should be allowed because the SerDe's conflict. As of this patch: - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE` - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE` ## How was this patch tested? New tests in `DDLCommandSuite`. Author: Andrew Or <andrew@databricks.com> Closes #13068 from andrewor14/row-format-conflict.	2016-05-23 11:55:03 -07:00
Wenchen Fan	07c36a2f07	[SPARK-15471][SQL] ScalaReflection cleanup ## What changes were proposed in this pull request? 1. simplify the logic of deserializing option type. 2. simplify the logic of serializing array type, and remove silentSchemaFor 3. remove some unnecessary code. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #13250 from cloud-fan/encoder.	2016-05-23 11:13:27 -07:00
wangyang	fc44b694bf	[SPARK-15379][SQL] check special invalid date ## What changes were proposed in this pull request? When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case). In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp. ## How was this patch tested? Unit tests are added. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: wangyang <wangyang@haizhi.com> Closes #13169 from wangyang1992/invalid_date.	2016-05-22 19:30:14 -07:00
Bo Meng	72288fd67e	[SPARK-15468][SQL] fix some typos ## What changes were proposed in this pull request? Fix some typos while browsing the codes. ## How was this patch tested? None and obvious. Author: Bo Meng <mengbo@hotmail.com> Author: bomeng <bmeng@us.ibm.com> Closes #13246 from bomeng/typo.	2016-05-22 08:10:54 -05:00
Tathagata Das	1ffa608ba5	[SPARK-15428][SQL] Disable multiple streaming aggregations ## What changes were proposed in this pull request? Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations. ## How was this patch tested? Additional unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13210 from tdas/SPARK-15428.	2016-05-22 02:08:18 -07:00
Reynold Xin	845e447fa0	[SPARK-15459][SQL] Make Range logical and physical explain consistent ## What changes were proposed in this pull request? This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it. Before this patch (note that the logical Range and physical Range actually output different information): ``` == Optimized Logical Plan == Range 0, 100, 2, 2, [id#8L] == Physical Plan == Range 0, 2, 2, 50, [id#8L] ``` After this patch: If step size is 1: ``` == Optimized Logical Plan == Range(0, 100, splits=2) == Physical Plan == Range(0, 100, splits=2) ``` If step size is not 1: ``` == Optimized Logical Plan == Range (0, 100, step=2, splits=2) == Physical Plan == *Range (0, 100, step=2, splits=2) ``` ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13239 from rxin/SPARK-15459.	2016-05-22 00:03:37 -07:00
gatorsmile	a11175eeca	[SPARK-15312][SQL] Detect Duplicate Key in Partition Spec and Table Properties #### What changes were proposed in this pull request? When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`. partition specs or table properties are widely used in multiple DDL statements. This PR is to detect the duplicates and issue an exception if found. #### How was this patch tested? Added test cases in DDLSuite Author: gatorsmile <gatorsmile@gmail.com> Closes #13095 from gatorsmile/detectDuplicate.	2016-05-21 23:56:10 -07:00
Reynold Xin	6d0bfb9601	Small documentation and style fix.	2016-05-21 23:12:56 -07:00
Jurriaan Pruis	223f633908	[SPARK-15415][SQL] Fix BroadcastHint when autoBroadcastJoinThreshold is 0 or -1 ## What changes were proposed in this pull request? This PR makes BroadcastHint more deterministic by using a special isBroadcastable property instead of setting the sizeInBytes to 1. See https://issues.apache.org/jira/browse/SPARK-15415 ## How was this patch tested? Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13244 from jurriaan/broadcast-hint.	2016-05-21 23:01:14 -07:00
gatorsmile	8f0a3d5bcb	[SPARK-15330][SQL] Implement Reset Command #### What changes were proposed in this pull request? Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202 This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below: > For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13121 from gatorsmile/resetCommand.	2016-05-21 20:07:34 -07:00
Dilip Biswal	5e1ee28984	[SPARK-15114][SQL] Column name generated by typed aggregate is super verbose ## What changes were proposed in this pull request? Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name. ```SQL val ds = Seq(1, 3, 2, 5).toDS() ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show() ``` Output before change. ```SQL +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+ \|typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))\|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))\| +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+ \| 11.0\| 2.75\| +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+ ``` Output after change: ```SQL +-----------------+---------------+ \|typedsumdouble_c1\|typedaverage_c2\| +-----------------+---------------+ \| 11.0\| 2.75\| +-----------------+---------------+ ``` Note: There is one test in ParquetSuites.scala which shows that that the system picked alias name is not usable and is rejected. [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687) ## How was this patch tested? A new test was added in DataSetAggregatorSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #13045 from dilipbiswal/spark-15114.	2016-05-21 08:36:08 -07:00
Dongjoon Hyun	f39621c998	[SPARK-15462][SQL][TEST] unresolved === false` is enough in testcases. ## What changes were proposed in this pull request? In only `catalyst` module, there exists 8 evaluation test cases on unresolved expressions. But, in real-world situation, those cases doesn't happen since they occurs exceptions before evaluations. ```scala scala> sql("select format_number(null, 3)") res0: org.apache.spark.sql.DataFrame = [format_number(CAST(NULL AS DOUBLE), 3): string] scala> sql("select format_number(cast(null as NULL), 3)") org.apache.spark.sql.catalyst.parser.ParseException: DataType null() is not supported.(line 1, pos 34) ``` This PR makes those testcases more realistic. ```scala - checkEvaluation(FormatNumber(Literal.create(null, NullType), Literal(3)), null) + assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false) ``` Also, this PR also removes redundant `resolved` checking in `FoldablePropagation` optimizer. ## How was this patch tested? Pass the modified Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13241 from dongjoon-hyun/SPARK-15462.	2016-05-21 08:11:14 -07:00
Sandeep Singh	666bf2e835	[SPARK-15445][SQL] Build fails for java 1.7 after adding java.mathBigInteger support ## What changes were proposed in this pull request? Using longValue() and then checking whether the value is in the range for a long manually. ## How was this patch tested? Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13223 from techaddict/SPARK-15445.	2016-05-21 06:39:47 -05:00
Davies Liu	0e70fd61b4	[SPARK-15438][SQL] improve explain of whole stage codegen ## What changes were proposed in this pull request? Currently, the explain of a query with whole-stage codegen looks like this ``` >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain() == Physical Plan == WholeStageCodegen : +- Project [id#1L] : +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None : :- Range 0, 1, 4, 1000, [id#1L] : +- INPUT +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) +- WholeStageCodegen : +- Range 0, 1, 4, 1000, [id#4L] ``` The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together). This PR will change it to: ``` >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain() == Physical Plan == Project [id#0L] +- BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None :- Range 0, 1, 4, 1000, [id#0L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- Range 0, 1, 4, 1000, [id#3L] ``` The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand. ## How was this patch tested? Manually ran some queries and check the explain. Author: Davies Liu <davies@databricks.com> Closes #13204 from davies/explain_codegen.	2016-05-20 13:21:53 -07:00
Shixiong Zhu	dfa61f7b13	[SPARK-15190][SQL] Support using SQLUserDefinedType for case classes ## What changes were proposed in this pull request? Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work. This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`. ## How was this patch tested? `encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")` Author: Shixiong Zhu <shixiong@databricks.com> Closes #12965 from zsxwing/SPARK-15190.	2016-05-20 12:38:46 -07:00
Kousuke Saruta	22947cd021	[SPARK-15165] [SPARK-15205] [SQL] Introduce place holder for comments in generated code ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? Existing tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12979 from sarutak/SPARK-15205.	2016-05-20 10:56:35 -07:00
Takuya UESHIN	2cbe96e64d	[SPARK-15400][SQL] CreateNamedStruct and CreateNamedStructUnsafe should preserve metadata of value expressions if it is NamedExpression. ## What changes were proposed in this pull request? `CreateNamedStruct` and `CreateNamedStructUnsafe` should preserve metadata of value expressions if it is `NamedExpression` like `CreateStruct` or `CreateStructUnsafe` are doing. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13193 from ueshin/issues/SPARK-15400.	2016-05-20 09:38:34 -07:00
Takuya UESHIN	d2e1aa97ef	[SPARK-15308][SQL] RowEncoder should preserve nested column name. ## What changes were proposed in this pull request? The following code generates wrong schema: ``` val schema = new StructType().add( "struct", new StructType() .add("i", IntegerType, nullable = false) .add( "s", new StructType().add("int", IntegerType, nullable = false), nullable = false), nullable = false) val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema)) ds.printSchema() ``` This should print as follows: ``` root \|-- struct: struct (nullable = false) \| \|-- i: integer (nullable = false) \| \|-- s: struct (nullable = false) \| \| \|-- int: integer (nullable = false) ``` but the result is: ``` root \|-- struct: struct (nullable = false) \| \|-- col1: integer (nullable = false) \| \|-- col2: struct (nullable = false) \| \| \|-- col1: integer (nullable = false) ``` This PR fixes `RowEncoder` to preserve nested column name. ## How was this patch tested? Existing tests and I added a test to check if `RowEncoder` preserves nested column name. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13090 from ueshin/issues/SPARK-15308.	2016-05-20 09:34:55 -07:00
Takuya UESHIN	d5e1c5acde	[SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject. ## What changes were proposed in this pull request? The following code: ``` val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_)) ``` throws an Exception: ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) ... Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) ... ``` This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`. The analyzed and optimized plans of the above example are as follows: ``` == Analyzed Logical Plan == _1: string Project [_1#420] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421] +- Filter <function1>.apply +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2 +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] == Optimized Logical Plan == !Project [_1#420] +- Filter <function1>.apply +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] ``` This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`. The plans after this patch are as follows: ``` == Analyzed Logical Plan == _1: string Project [_1#420] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421] +- Filter <function1>.apply +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2 +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] == Optimized Logical Plan == Project [_1#416] +- Filter <function1>.apply +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] ``` ## How was this patch tested? Existing tests and I added a test to check if `filter and then select` works. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13096 from ueshin/issues/SPARK-15313.	2016-05-19 22:55:44 -07:00
Reynold Xin	3ba34d435c	[SPARK-14990][SQL] Fix checkForSameTypeInputExpr (ignore nullability) ## What changes were proposed in this pull request? This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account. This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case. ## How was this patch tested? Added a new test suite and test case. Closes #12768. Author: Reynold Xin <rxin@databricks.com> Author: Oleg Danilov <oleg.danilov@wandisco.com> Closes #13208 from rxin/SPARK-14990.	2016-05-19 22:14:10 -07:00
Kevin Yu	17591d90e6	[SPARK-11827][SQL] Adding java.math.BigInteger support in Java type inference for POJOs and Java collections Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. . Author: Kevin Yu <qyu@us.ibm.com> Closes #10125 from kevinyu98/working_on_spark-11827.	2016-05-20 12:41:14 +08:00
Sumedh Mungee	d5c47f8ff8	[SPARK-15321] Fix bug where Array[Timestamp] cannot be encoded/decoded correctly ## What changes were proposed in this pull request? Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case: `encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp") ` ... you will see that (without this fix) it fails with the following output: ``` - encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde * FAILED * Exception thrown while decoding Converted: [0,1000000010,800000001,52a7ccdc36800] Schema: value#61615 root -- value: array (nullable = true) \|-- element: timestamp (containsNull = true) Encoder: class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312) ``` ## How was this patch tested? Existing tests Author: Sumedh Mungee <smungee@gmail.com> Closes #13108 from smungee/fix-itemAccessorMethod.	2016-05-20 12:30:04 +08:00
Cheng Lian	6ac1c3a040	[SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE output ## What changes were proposed in this pull request? This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message. ## How was this patch tested? Updated existing test case to check exception message. Author: Cheng Lian <lian@databricks.com> Closes #13173 from liancheng/spark-14346-follow-up.	2016-05-19 12:02:41 -07:00
Kousuke Saruta	faafd1e9db	[SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to make database directory. ## What changes were proposed in this pull request? After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory. ## How was this patch tested? Manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13175 from sarutak/SPARK-15387.	2016-05-19 11:51:59 -07:00
gatorsmile	ef7a5e0bca	[SPARK-14603][SQL][FOLLOWUP] Verification of Metadata Operations by Session Catalog #### What changes were proposed in this pull request? This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385 The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135 For example, in PySpark, if we input the following statement: ```python >>> l = [('Alice', 1)] >>> df = sqlContext.createDataFrame(l) >>> df.createTempView("people") >>> df.createTempView("people") ``` Before this PR, the exception we will get is like ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView self._jdf.createTempView(name) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco return f(a, *kw) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView. : org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324) at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523) at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) ``` After this PR, the exception we will get become cleaner: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView self._jdf.createTempView(name) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;" ``` #### How was this patch tested? Fixed an existing PySpark test case Author: gatorsmile <gatorsmile@gmail.com> Closes #13126 from gatorsmile/followup-14684.	2016-05-19 11:46:11 -07:00
Dongjoon Hyun	5907ebfc11	[SPARK-14939][SQL] Add FoldablePropagation optimizer ## What changes were proposed in this pull request? This PR aims to add new FoldablePropagation optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.) 1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()" 2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3" 3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z" This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did. Before ``` scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain == Physical Plan == WholeStageCodegen : +- Sort [1.0#5 ASC,x#0 ASC], true, 0 : +- INPUT +- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None +- WholeStageCodegen : +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0] : +- INPUT +- Scan OneRowRelation[] ``` After ``` scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain == Physical Plan == WholeStageCodegen : +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0] : +- INPUT +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests including a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12719 from dongjoon-hyun/SPARK-14939.	2016-05-19 15:57:44 +08:00
Wenchen Fan	661c21049b	[SPARK-15381] [SQL] physical object operator should define reference correctly ## What changes were proposed in this pull request? Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13167 from cloud-fan/bug.	2016-05-18 21:43:07 -07:00
Wenchen Fan	ebfe3a1f2c	[SPARK-15192][SQL] null check for SparkSession.createDataFrame ## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13008 from cloud-fan/row-encoder.	2016-05-18 18:06:38 -07:00
Dongjoon Hyun	d2f81df1ba	[MINOR][SQL] Remove unused pattern matching variables in Optimizers. ## What changes were proposed in this pull request? This PR removes unused pattern matching variable in Optimizers in order to improve readability. ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13145 from dongjoon-hyun/remove_unused_pattern_matching_variables.	2016-05-18 11:51:50 +01:00
Yin Huai	2a5db9c140	[SPARK-14346] Fix scala-2.10 build ## What changes were proposed in this pull request? Scala 2.10 build was broken by #13079. I am reverting the change of that line. Author: Yin Huai <yhuai@databricks.com> Closes #13157 from yhuai/SPARK-14346-fix-scala2.10.	2016-05-17 18:02:31 -07:00
Cheng Lian	b674e67c22	[SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views ## What changes were proposed in this pull request? This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc.. ## How was this patch tested? New test cases are added in `ShowCreateTableSuite` to do round-trip tests. Author: Cheng Lian <lian@databricks.com> Closes #13079 from liancheng/spark-14346-show-create-table-for-hive-tables.	2016-05-17 15:56:44 -07:00
Kousuke Saruta	c0c3ec3547	[SPARK-15165] [SQL] Codegen can break because toCommentSafeString is not actually safe ## What changes were proposed in this pull request? toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break. Following code causes compilation error. ``` val df = Seq(...).toDF df.select("'\\\\\\\\u002A/'").show ``` The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. ``` val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'\\\\\\\\u002A/{System.exit(1);}/'").show ``` ## How was this patch tested? Added new test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: sarutak <sarutak@oss.nttdata.co.jp> Closes #12939 from sarutak/SPARK-15165.	2016-05-17 10:07:01 -07:00
Wenchen Fan	c36ca651f9	[SPARK-15351][SQL] RowEncoder should support array as the external type for ArrayType ## What changes were proposed in this pull request? This PR improves `RowEncoder` and `MapObjects`, to support array as the external type for `ArrayType`. The idea is straightforward, we use `Object` as the external input type for `ArrayType`, and determine its type at runtime in `MapObjects`. ## How was this patch tested? new test in `RowEncoderSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13138 from cloud-fan/map-object.	2016-05-17 17:02:52 +08:00
Sean Owen	122302cbf5	[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags ## What changes were proposed in this pull request? (See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.) Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags` ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13074 from srowen/SPARK-15290.	2016-05-17 09:55:53 +01:00
Reynold Xin	e1dc853737	[SPARK-15310][SQL] Rename HiveTypeCoercion -> TypeCoercion ## What changes were proposed in this pull request? We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion. ## How was this patch tested? Updated unit tests to reflect the rename. Author: Reynold Xin <rxin@databricks.com> Closes #13091 from rxin/SPARK-15310.	2016-05-13 00:15:39 -07:00
Reynold Xin	eda2800d44	[SPARK-14541][SQL] Support IFNULL, NULLIF, NVL and NVL2 ## What changes were proposed in this pull request? This patch adds support for a few SQL functions to improve compatibility with other databases: IFNULL, NULLIF, NVL and NVL2. In order to do this, this patch introduced a RuntimeReplaceable expression trait that allows replacing an unevaluable expression in the optimizer before evaluation. Note that the semantics are not completely identical to other databases in esoteric cases. ## How was this patch tested? Added a new test suite SQLCompatibilityFunctionSuite. Closes #12373. Author: Reynold Xin <rxin@databricks.com> Closes #13084 from rxin/SPARK-14541.	2016-05-12 22:18:39 -07:00
Reynold Xin	ba169c3230	[SPARK-15306][SQL] Move object expressions into expressions.objects package ## What changes were proposed in this pull request? This patch moves all the object related expressions into expressions.objects package, for better code organization. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13085 from rxin/SPARK-15306.	2016-05-12 21:35:14 -07:00
Herman van Hovell	bb1362eb3b	[SPARK-10605][SQL] Create native collect_list/collect_set aggregates ## What changes were proposed in this pull request? We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark. The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs. This `collect` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`. I have done some performance testing: ```scala import org.apache.spark.sql.{Dataset, Row} sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'") val df = range(0, 10000000).select($"id", (rand(213123L) 100000).cast("int").as("grp")) df.select(countDistinct($"grp")).show def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = { // Do not measure planning. plan1.queryExecution.executedPlan // Execute the plan a number of times and average the result. val start = System.nanoTime var i = 0 while (i < maxItr) { plan.rdd.foreach(row => Unit) i += 1 } val time = (System.nanoTime - start) / (maxItr * 1000000L) println(s"[$name] $maxItr iterations completed in an average time of $time ms.") } val plan1 = df.groupBy($"grp").agg(collect_list($"id")) val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id")) benchmark("Spark collect_list", plan1) ... > [Spark collect_list] 5 iterations completed in an average time of 3371 ms. benchmark("Hive collect_list", plan2) ... > [Hive collect_list] 5 iterations completed in an average time of 9109 ms. ``` Performance is improved by a factor 2-3. ## How was this patch tested? Added tests to `DataFrameAggregateSuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12874 from hvanhovell/implode.	2016-05-12 13:56:00 -07:00
gatorsmile	be617f3d06	[SPARK-14684][SPARK-15277][SQL] Partition Spec Validation in SessionCatalog and Checking Partition Spec Existence Before Dropping #### What changes were proposed in this pull request? ~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~ ~~This PR contains the following behavior changes:~~ ~~- disallow dropping multiple partitions by a single command ~~ ~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~ ~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~ This PR has two major parts: - Verify the partition spec in SessionCatalog for fixing the following issue: ```scala sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')") ``` Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec. - Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped. #### How was this patch tested? Modified the existing test cases and added new test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12801 from gatorsmile/banDropMultiPart.	2016-05-12 11:14:40 -07:00
Liang-Chi Hsieh	470de743ec	[SPARK-15094][SPARK-14803][SQL] Remove extra Project added in EliminateSerialization ## What changes were proposed in this pull request? We will eliminate the pair of `DeserializeToObject` and `SerializeFromObject` in `Optimizer` and add extra `Project`. However, when DeserializeToObject's outputObjectType is ObjectType and its cls can't be processed by unsafe project, it will be failed. To fix it, we can simply remove the extra `Project` and replace the output attribute of `DeserializeToObject` in another rule. ## How was this patch tested? `DatasetSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12926 from viirya/fix-eliminate-serialization-projection.	2016-05-12 10:11:12 -07:00
Sean Zhong	33c6eb5218	[SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.	2016-05-12 15:51:53 +08:00
Wenchen Fan	46991448aa	[SPARK-15160][SQL] support data source table in InMemoryCatalog ## What changes were proposed in this pull request? This PR adds a new rule to convert `SimpleCatalogRelation` to data source table if its table property contains data source information. ## How was this patch tested? new test in SQLQuerySuite Author: Wenchen Fan <wenchen@databricks.com> Closes #12935 from cloud-fan/ds-table.	2016-05-11 23:55:42 -07:00
Cheng Lian	f036dd7ce7	[SPARK-14346] SHOW CREATE TABLE for data source tables ## What changes were proposed in this pull request? This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s). To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature. Author: Cheng Lian <lian@databricks.com> Closes #12781 from liancheng/spark-14346-show-create-table.	2016-05-11 20:44:04 -07:00
Eric Liang	6d0368ab8d	[SPARK-15259] Sort time metric should not include spill and record insertion time ## What changes were proposed in this pull request? After SPARK-14669 it seems the sort time metric includes both spill and record insertion time. This makes it not very useful since the metric becomes close to the total execution time of the node. We should track just the time spent for in-memory sort, as before. ## How was this patch tested? Verified metric in the UI, also unit test on UnsafeExternalRowSorter. cc davies Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13035 from ericl/fix-metrics.	2016-05-11 11:25:46 -07:00
Wenchen Fan	d8935db5ec	[SPARK-15241] [SPARK-15242] [SQL] fix 2 decimal-related issues in RowEncoder ## What changes were proposed in this pull request? SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal. SPARK-15242: This is a long-standing bug, and is exposed after https://github.com/apache/spark/pull/12364, which eliminate the `If` expression if the field is not nullable: ``` val fieldValue = serializerFor( GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)), f.dataType) if (f.nullable) { If( Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil), Literal.create(null, f.dataType), fieldValue) } else { fieldValue } ``` Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type. Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row. The fix is simple, just use the given decimal type as the output type of converted decimal field. These 2 issues was found at https://github.com/apache/spark/pull/13008 ## How was this patch tested? new tests in RowEncoderSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13019 from cloud-fan/encoder-decimal.	2016-05-11 11:16:05 -07:00
Liang-Chi Hsieh	a5f9fdbba3	[SPARK-15268][SQL] Make JavaTypeInference work with UDTRegistration ## What changes were proposed in this pull request? We have a private `UDTRegistration` API to register user defined type. Currently `JavaTypeInference` can't work with it. So `SparkSession.createDataFrame` from a bean class will not correctly infer the schema of the bean class. ## How was this patch tested? `VectorUDTSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13046 from viirya/fix-udt-registry-javatypeinference.	2016-05-11 09:31:22 -07:00
Dongjoon Hyun	6655459606	[SPARK-15265][SQL][MINOR] Fix Union query error message indentation ## What changes were proposed in this pull request? This issue fixes the error message indentation consistently with other set queries (EXCEPT/INTERSECT). Before (4 lines) ``` scala> sql("(select 1) union (select 1, 2)").head org.apache.spark.sql.AnalysisException: Unions can only be performed on tables with the same number of columns, but one table has '2' columns and another table has '1' columns; ``` After (one-line) ``` scala> sql("(select 1) union (select 1, 2)").head org.apache.spark.sql.AnalysisException: Unions can only be performed on tables with the same number of columns, but one table has '2' columns and another table has '1' columns; ``` Reference (EXCEPT / INTERSECT) ``` scala> sql("(select 1) intersect (select 1, 2)").head org.apache.spark.sql.AnalysisException: Intersect can only be performed on tables with the same number of columns, but the left table has 1 columns and the right has 2; ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13043 from dongjoon-hyun/SPARK-15265.	2016-05-10 22:27:22 -07:00
Sandeep Singh	da02d006bb	[SPARK-15249][SQL] Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource see: TODO's here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L36 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala#L42 Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13024 from techaddict/SPARK-15249.	2016-05-10 14:22:03 -07:00
Herman van Hovell	d28c67544b	[SPARK-14986][SQL] Return correct result for empty LATERAL VIEW OUTER ## What changes were proposed in this pull request? A Generate with the `outer` flag enabled should always return one or more rows for every input row. The optimizer currently violates this by rewriting `outer` Generates that do not contain columns of the child plan into an unjoined generate, for example: ```sql select e from a lateral view outer explode(a.b) as e ``` The result of this is that `outer` Generate does not produce output at all when the Generators' input expression is empty. This PR fixes this. ## How was this patch tested? Added test case to `SQLQuerySuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12906 from hvanhovell/SPARK-14986.	2016-05-10 12:47:31 -07:00
gatorsmile	5c6b085578	[SPARK-14603][SQL] Verification of Metadata Operations by Session Catalog Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog. - [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog. - [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801 - [X] The third step is to add database existence verification in `SessionCatalog` - [X] The fourth step is to add table existence verification in `SessionCatalog` - [X] The fifth step is to add function existence verification in `SessionCatalog` Add test cases and verify the error messages we issued Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12385 from gatorsmile/verifySessionAPIs.	2016-05-10 11:25:55 -07:00
Herman van Hovell	2646265368	[SPARK-14773] [SPARK-15179] [SQL] Fix SQL building and enable Hive tests ## What changes were proposed in this pull request? This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests. ## How was this patch tested? Enabled new tests in HiveComparisionSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12988 from hvanhovell/SPARK-14773.	2016-05-10 09:56:07 -07:00
gatorsmile	5706472670	[SPARK-15215][SQL] Fix Explain Parsing and Output #### What changes were proposed in this pull request? This PR is to address a few existing issues in `EXPLAIN`: - The `EXPLAIN` options `LOGICAL \| FORMATTED \| EXTENDED \| CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command. - The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command. - The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example: ``` == Parsed Logical Plan == CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false == Analyzed Logical Plan == CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false == Optimized Logical Plan == CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false ... ``` #### How was this patch tested? Added and modified a few test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12991 from gatorsmile/explainCreateTable.	2016-05-10 11:53:37 +02:00
gatorsmile	f45379173b	[SPARK-15187][SQL] Disallow Dropping Default Database #### What changes were proposed in this pull request? In Hive Metastore, dropping default database is not allowed. However, in `InMemoryCatalog`, this is allowed. This PR is to disallow users to drop default database. #### How was this patch tested? Previously, we already have a test case in HiveDDLSuite. Now, we also add the same one in DDLSuite Author: gatorsmile <gatorsmile@gmail.com> Closes #12962 from gatorsmile/dropDefaultDB.	2016-05-10 11:57:01 +08:00
Andrew Or	8f932fb88d	[SPARK-15234][SQL] Fix spark.catalog.listDatabases.show() ## What changes were proposed in this pull request? Before: ``` scala> spark.catalog.listDatabases.show() +--------------------+-----------+-----------+ \| name\|description\|locationUri\| +--------------------+-----------+-----------+ \|Database[name='de...\| \|Database[name='my...\| \|Database[name='so...\| +--------------------+-----------+-----------+ ``` After: ``` +-------+--------------------+--------------------+ \| name\| description\| locationUri\| +-------+--------------------+--------------------+ \|default\|Default Hive data...\|file:/user/hive/w...\| \| my_db\| This is a database\|file:/Users/andre...\| \|some_db\| \|file:/private/var...\| +-------+--------------------+--------------------+ ``` ## How was this patch tested? New test in `CatalogSuite` Author: Andrew Or <andrew@databricks.com> Closes #13015 from andrewor14/catalog-show.	2016-05-09 20:02:23 -07:00
Josh Rosen	c3350cadb8	[SPARK-14972] Improve performance of JSON schema inference's compatibleType method This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema. The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass. This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting. I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods. Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage. Author: Josh Rosen <joshrosen@databricks.com> Closes #12750 from JoshRosen/schema-inference-speedups.	2016-05-09 13:11:18 -07:00
Zheng RuiFeng	dfdcab00c7	[SPARK-15210][SQL] Add missing @DeveloperApi annotation in sql.types add DeveloperApi annotation for `AbstractDataType` `MapType` `UserDefinedType` local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12982 from zhengruifeng/types_devapi.	2016-05-09 11:21:16 -07:00
Liang-Chi Hsieh	e083db2e9e	[SPARK-15225][SQL] Replace SQLContext with SparkSession in Encoder documentation `Encoder`'s doc mentions `sqlContext.implicits._`. We should use `sparkSession.implicits._` instead now. Only doc update. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13002 from viirya/encoder-doc.	2016-05-09 11:06:08 -07:00
Wenchen Fan	beb16ec556	[SPARK-15093][SQL] create/delete/rename directory for InMemoryCatalog operations if needed ## What changes were proposed in this pull request? following operations have file system operation now: 1. CREATE DATABASE: create a dir 2. DROP DATABASE: delete the dir 3. CREATE TABLE: create a dir 4. DROP TABLE: delete the dir 5. RENAME TABLE: rename the dir 6. CREATE PARTITIONS: create a dir 7. RENAME PARTITIONS: rename the dir 8. DROP PARTITIONS: drop the dir ## How was this patch tested? new tests in `ExternalCatalogSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12871 from cloud-fan/catalog.	2016-05-09 10:47:45 -07:00
Ryan Blue	652bbb1bf6	[SPARK-14459][SQL] Detect relation partitioning and adjust the logical plan ## What changes were proposed in this pull request? This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. ## How was this patch tested? This adds new tests to the InsertIntoTableSuite that are fixed by this PR. Author: Ryan Blue <blue@apache.org> Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning.	2016-05-09 17:01:23 +08:00
gatorsmile	a59ab594ca	[SPARK-15184][SQL] Fix Silent Removal of An Existent Temp Table by Rename Table #### What changes were proposed in this pull request? Currently, if we rename a temp table `Tab1` to another existent temp table `Tab2`. `Tab2` will be silently removed. This PR is to detect it and issue an exception message. In addition, this PR also detects another issue in the rename table command. When the destination table identifier does have database name, we should not ignore them. That might mean users could rename a regular table. #### How was this patch tested? Added two related test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12959 from gatorsmile/rewriteTable.	2016-05-09 13:05:18 +08:00
gatorsmile	e9131ec277	[SPARK-15185][SQL] InMemoryCatalog: Silent Removal of an Existent Table/Function/Partitions by Rename #### What changes were proposed in this pull request? So far, in the implementation of InMemoryCatalog, we do not check if the new/destination table/function/partition exists or not. Thus, we just silently remove the existent table/function/partition. This PR is to detect them and issue an appropriate exception. #### How was this patch tested? Added the related test cases. They also verify if HiveExternalCatalog also detects these errors. Author: gatorsmile <gatorsmile@gmail.com> Closes #12960 from gatorsmile/renameInMemoryCatalog.	2016-05-09 12:40:30 +08:00
Herman van Hovell	df89f1d43d	[SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out ## What changes were proposed in this pull request? The official TPC-DS 41 query currently fails because it contains a scalar subquery with a disjunctive correlated predicate (the correlated predicates were nested in ORs). This makes the `Analyzer` pull out the entire predicate which is wrong and causes the following (correct) analysis exception: `The correlated scalar subquery can only contain equality predicates` This PR fixes this by first simplifing (or normalizing) the correlated predicates before pulling them out of the subquery. ## How was this patch tested? Manual testing on TPC-DS 41, and added a test to SubquerySuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12954 from hvanhovell/SPARK-15122.	2016-05-06 21:06:03 -07:00
gatorsmile	5c8fad7b9b	[SPARK-15108][SQL] Describe Permanent UDTF #### What changes were proposed in this pull request? When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry. This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function. #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #12885 from gatorsmile/showFunction.	2016-05-06 11:43:07 -07:00
Jacek Laskowski	bbb7773437	[SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements ## What changes were proposed in this pull request? Minor doc and code style fixes ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12928 from jaceklaskowski/SPARK-15152.	2016-05-05 16:34:27 -07:00
Shixiong Zhu	bb9991dec5	[SPARK-15135][SQL] Make sure SparkSession thread safe ## What changes were proposed in this pull request? Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession ## How was this patch tested? Existing unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #12915 from zsxwing/spark-session-thread-safe.	2016-05-05 14:36:47 -07:00
gatorsmile	8cba57a75c	[SPARK-14124][SQL][FOLLOWUP] Implement Database-related DDL Commands #### What changes were proposed in this pull request? First, a few test cases failed in mac OS X because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web: ``` Win NT --> C:\TEMP\ Win XP --> C:\TEMP Solaris --> /var/tmp/ Linux --> /var/tmp ``` Second, a couple of test cases are added to verify if the commands work properly. #### How was this patch tested? Added a test case for it and correct the previous test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12081 from gatorsmile/mkdir.	2016-05-05 14:34:24 -07:00
Wenchen Fan	55cc1c991a	[SPARK-14139][SQL] RowEncoder should preserve schema nullability ## What changes were proposed in this pull request? The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info. TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR. ## How was this patch tested? new tests in `RowEncoderSuite` Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers Author: Wenchen Fan <wenchen@databricks.com> Author: Koert Kuipers <koert@tresata.com> Closes #12364 from cloud-fan/nullable.	2016-05-06 01:08:04 +08:00
Kousuke Saruta	1a9b341581	[SPARK-15132][MINOR][SQL] Debug log for generated code should be printed with proper indentation ## What changes were proposed in this pull request? Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation. ## How was this patch tested? Manually checked. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12908 from sarutak/SPARK-15132.	2016-05-04 22:18:55 -07:00
Sean Zhong	8fb1463d6a	[SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query ## What changes were proposed in this pull request? This PR support new SQL syntax CREATE TEMPORARY VIEW. Like: ``` CREATE TEMPORARY VIEW viewName AS SELECT * from xx CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx ``` ## How was this patch tested? Unit tests. Author: Sean Zhong <clockfly@gmail.com> Closes #12872 from clockfly/spark-6399.	2016-05-04 18:27:25 -07:00
Liang-Chi Hsieh	b85d21fb9d	[SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate ## What changes were proposed in this pull request? We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen. However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes. For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.	2016-05-04 10:54:51 -07:00
Cheng Lian	f152fae306	[SPARK-14127][SQL] Native "DESC [EXTENDED \| FORMATTED] <table>" DDL command ## What changes were proposed in this pull request? This PR implements native `DESC [EXTENDED \| FORMATTED] <table>` DDL command. Sample output: ``` scala> spark.sql("desc extended src").show(100, truncate = false) +----------------------------+---------------------------------+-------+ \|col_name \|data_type \|comment\| +----------------------------+---------------------------------+-------+ \|key \|int \| \| \|value \|string \| \| \| \| \| \| \|# Detailed Table Information\|CatalogTable(`default`.`src`, ...\| \| +----------------------------+---------------------------------+-------+ scala> spark.sql("desc formatted src").show(100, truncate = false) +----------------------------+----------------------------------------------------------+-------+ \|col_name \|data_type \|comment\| +----------------------------+----------------------------------------------------------+-------+ \|key \|int \| \| \|value \|string \| \| \| \| \| \| \|# Detailed Table Information\| \| \| \|Database: \|default \| \| \|Owner: \|lian \| \| \|Create Time: \|Mon Jan 04 17:06:00 CST 2016 \| \| \|Last Access Time: \|Thu Jan 01 08:00:00 CST 1970 \| \| \|Location: \|hdfs://localhost:9000/user/hive/warehouse_hive121/src \| \| \|Table Type: \|MANAGED \| \| \|Table Parameters: \| \| \| \| transient_lastDdlTime \|1451898360 \| \| \| \| \| \| \|# Storage Information \| \| \| \|SerDe Library: \|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe \| \| \|InputFormat: \|org.apache.hadoop.mapred.TextInputFormat \| \| \|OutputFormat: \|org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\| \| \|Num Buckets: \|-1 \| \| \|Bucket Columns: \|[] \| \| \|Sort Columns: \|[] \| \| \|Storage Desc Parameters: \| \| \| \| serialization.format \|1 \| \| +----------------------------+----------------------------------------------------------+-------+ ``` ## How was this patch tested? A test case is added to `HiveDDLSuite` to check command output. Author: Cheng Lian <lian@databricks.com> Closes #12844 from liancheng/spark-14127-desc-table.	2016-05-04 16:44:09 +08:00
Wenchen Fan	6c12e801e8	[SPARK-15029] improve error message for Generate ## What changes were proposed in this pull request? This PR improve the error message for `Generate` in 3 cases: 1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl` 2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl` 3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)` ## How was this patch tested? new tests in `AnalysisErrorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12810 from cloud-fan/bug.	2016-05-04 00:10:20 -07:00
Andrew Or	6ba17cd147	[SPARK-14414][SQL] Make DDL exceptions more consistent ## What changes were proposed in this pull request? Just a bunch of small tweaks on DDL exception messages. ## How was this patch tested? `DDLCommandSuite` et al. Author: Andrew Or <andrew@databricks.com> Closes #12853 from andrewor14/make-exceptions-consistent.	2016-05-03 18:07:53 -07:00
gatorsmile	71296c041e	[SPARK-15056][SQL] Parse Unsupported Sampling Syntax and Issue Better Exceptions #### What changes were proposed in this pull request? Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling - In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example, ```SQL SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; ``` - Users can specify the total length to be read. For example, ```SQL SELECT * FROM source TABLESAMPLE(100M) s; ``` Below is the link for references: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling This PR is to parse and capture these two extra syntax, and issue a better error message. #### How was this patch tested? Added test cases to verify the thrown exceptions Author: gatorsmile <gatorsmile@gmail.com> Closes #12838 from gatorsmile/bucketOnRand.	2016-05-03 23:20:18 +02:00
Andrew Ray	d8f528ceb6	[SPARK-13749][SQL][FOLLOW-UP] Faster pivot implementation for many distinct values with two phase aggregation ## What changes were proposed in this pull request? This is a follow up PR for #11583. It makes 3 lazy vals into just vals and adds unit test coverage. ## How was this patch tested? Existing unit tests and additional unit tests. Author: Andrew Ray <ray.andrew@gmail.com> Closes #12861 from aray/fast-pivot-follow-up.	2016-05-02 22:47:32 -07:00
bomeng	0fd95be3cd	[SPARK-15062][SQL] fix list type infer serializer issue ## What changes were proposed in this pull request? Make serializer correctly inferred if the input type is `List[_]`, since `List[_]` is type of `Seq[_]`, before it was matched to different case (`case t if definedByConstructorParams(t)`). ## How was this patch tested? New test case was added. Author: bomeng <bmeng@us.ibm.com> Closes #12849 from bomeng/SPARK-15062.	2016-05-02 18:20:29 -07:00
Herman van Hovell	1c19c2769e	[SPARK-15047][SQL] Cleanup SQL Parser ## What changes were proposed in this pull request? This PR addresses a few minor issues in SQL parser: - Removes some unused rules and keywords in the grammar. - Removes code path for fallback SQL parsing (was needed for Hive native parsing). - Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`. - Adds a more generic way of creating error messages for unsupported Hive features. - Use `visitFunctionName` as much as possible. - Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12826 from hvanhovell/SPARK-15047.	2016-05-02 18:12:31 -07:00
Herman van Hovell	f362363d14	[SPARK-14785] [SQL] Support correlated scalar subqueries ## What changes were proposed in this pull request? In this PR we add support for correlated scalar subqueries. An example of such a query is: ```SQL select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key) ``` The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans. I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery before aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this. The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`. ## How was this patch tested? Added tests to `SubquerySuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12822 from hvanhovell/SPARK-14785.	2016-05-02 16:32:31 -07:00
Davies Liu	95e372141a	[SPARK-14781] [SQL] support nested predicate subquery ## What changes were proposed in this pull request? In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter. In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR). For example, the following SQL: ```sql SELECT a FROM t WHERE EXISTS (select 0) OR EXISTS (select 1) ``` This PR also fix a bug in predicate subquery push down through join (they should not). Nested null-aware subquery is still not supported. For example, `a > 3 OR b NOT IN (select bb from t)` After this, we could run TPCDS query Q10, Q35, Q45 ## How was this patch tested? Added unit tests. Author: Davies Liu <davies@databricks.com> Closes #12820 from davies/or_exists.	2016-05-02 12:58:59 -07:00
Dongjoon Hyun	6e6320122e	[SPARK-14830][SQL] Add RemoveRepetitionFromGroupExpressions optimizer. ## What changes were proposed in this pull request? This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added. Before ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` After ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12590 from dongjoon-hyun/SPARK-14830.	2016-05-02 12:40:21 -07:00
Andrew Ray	9927441868	[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation ## What changes were proposed in this pull request? The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them. I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column. ## How was this patch tested? Additional unit tests in DataFramePivotSuite and manual larger scale testing. Author: Andrew Ray <ray.andrew@gmail.com> Closes #11583 from aray/fast-pivot.	2016-05-02 11:12:55 -07:00
Wenchen Fan	0513c3ac93	[SPARK-14637][SQL] object expressions cleanup ## What changes were proposed in this pull request? Simplify and clean up some object expressions: 1. simplify the logic to handle `propagateNull` 2. add `propagateNull` parameter to `Invoke` 3. simplify the unbox logic in `Invoke` 4. other minor cleanup TODO: simplify `MapObjects` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12399 from cloud-fan/object.	2016-05-02 10:21:14 -07:00
Yin Huai	0182d9599d	[SPARK-15034][SPARK-15035][SPARK-15036][SQL] Use spark.sql.warehouse.dir as the warehouse location This PR contains three changes: 1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir. 2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string. 3. When we create a database, we need to make the path qualified. Existing tests and new tests Author: Yin Huai <yhuai@databricks.com> Closes #12812 from yhuai/warehouse.	2016-04-30 18:04:42 -07:00
Wenchen Fan	43b149fb88	[SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT ## What changes were proposed in this pull request? This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT. ## How was this patch tested? existing tests and new test suite `UnsafeArraySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12640 from cloud-fan/ml.	2016-04-29 23:04:51 -07:00
Yin Huai	ac41fc648d	[SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the nonReserved list. ## What changes were proposed in this pull request? CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12796 from yhuai/removeDataTypeParser.	2016-04-29 22:49:12 -07:00
Reynold Xin	7945f9f6d4	[SPARK-14757] [SQL] Fix nullability bug in EqualNullSafe codegen ## What changes were proposed in this pull request? This patch fixes a null handling bug in EqualNullSafe's code generation. ## How was this patch tested? Updated unit test so they would fail without the fix. Closes #12628. Author: Reynold Xin <rxin@databricks.com> Author: Arash Nabili <arash@levyx.com> Closes #12799 from rxin/equalnullsafe.	2016-04-29 22:26:12 -07:00
Herman van Hovell	83061be697	[SPARK-14858] [SQL] Enable subquery pushdown The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example : ```scala range(0, 10).registerTempTable("a") range(5, 15).registerTempTable("b") range(7, 25).registerTempTable("c") range(3, 12).registerTempTable("d") val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)") plan.explain(true) ``` Leads to the following Analyzed & Optimized plans: ``` == Parsed Logical Plan == ... == Analyzed Logical Plan == id: bigint, id: bigint, id: bigint Project [id#0L,id#4L,id#8L] +- Filter predicate-subquery#16 [(id#0L = id#12L)] : +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)] : +- Project [id#12L] : +- SubqueryAlias d : +- Range 3, 12, 1, 8, [id#12L] +- Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- SubqueryAlias a : : +- Range 0, 10, 1, 8, [id#0L] : +- SubqueryAlias b : +- Range 5, 15, 1, 8, [id#4L] +- SubqueryAlias c +- Range 7, 25, 1, 8, [id#8L] == Optimized Logical Plan == Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- Join LeftSemi, Some((id#0L = id#12L)) : : :- Range 0, 10, 1, 8, [id#0L] : : +- Range 3, 12, 1, 8, [id#12L] : +- Range 5, 15, 1, 8, [id#4L] +- Range 7, 25, 1, 8, [id#8L] == Physical Plan == ... ``` I have also taken the opportunity to move quite a bit of code around: - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do. - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child). - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution. These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere. Current tests and added test cases in FilterPushdownSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12720 from hvanhovell/SPARK-14858.	2016-04-29 16:50:12 -07:00
Sun Rui	4ae9fe091c	[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. ## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.	2016-04-29 16:41:07 -07:00
Reynold Xin	054f991c43	[SPARK-14994][SQL] Remove execution hive from HiveSessionState ## What changes were proposed in this pull request? This patch removes executionHive from HiveSessionState and HiveSharedState. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12770 from rxin/SPARK-14994.	2016-04-29 01:14:02 -07:00
gatorsmile	222dcf7937	[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join #### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.	2016-04-29 15:30:36 +08:00
Reynold Xin	4607f6e7f7	[SPARK-14991][SQL] Remove HiveNativeCommand ## What changes were proposed in this pull request? This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive. ## How was this patch tested? Updated tests to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12769 from rxin/SPARK-14991.	2016-04-28 21:58:48 -07:00
Yin Huai	9c7c42bc6a	Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local" This reverts commit `dae538a4d7`.	2016-04-28 19:57:41 -07:00
Pravin Gadakh	dae538a4d7	[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.	2016-04-28 15:59:18 -07:00
Gregory Hart	12c360c057	[SPARK-14965][SQL] Indicate an exception is thrown for a missing struct field ## What changes were proposed in this pull request? Fix to ScalaDoc for StructType. ## How was this patch tested? Built locally. Author: Gregory Hart <greg.hart@thinkbiganalytics.com> Closes #12758 from freastro/hotfix/SPARK-14965.	2016-04-28 11:21:43 -07:00
Liang-Chi Hsieh	7c6937a885	[SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation ## What changes were proposed in this pull request? Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes. For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty. We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation. ## How was this patch tested? `UserDefinedTypeSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12259 from viirya/improve-sql-usertype.	2016-04-28 01:14:49 -07:00
Andrew Or	37575115b9	[SPARK-14940][SQL] Move ExternalCatalog to own file ## What changes were proposed in this pull request? `interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness. ## How was this patch tested? Just moving things around. Author: Andrew Or <andrew@databricks.com> Closes #12721 from andrewor14/move-external-catalog.	2016-04-27 14:17:36 -07:00
Cheng Lian	24bea00047	[SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source CTAS syntax Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL: ``` CREATE TABLE <table-name> USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)] [PARTITIONED BY (col1, col2, ...)] [CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS] AS SELECT ... ``` Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12734 from liancheng/spark-14954.	2016-04-27 13:55:13 -07:00
Dongjoon Hyun	af92299fdb	[SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queries ## What changes were proposed in this pull request? This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer. Sum ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1) scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#21] +- Scan OneRowRelation[] ``` Average ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#42] +- Scan OneRowRelation[] ``` After this PR, those queries work fine and new optimized physical plans look like the followings. Sum ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#33] +- Scan OneRowRelation[] ``` Average ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#45] +- Scan OneRowRelation[] ``` In this PR, SUM over window pattern matching is based on the code of hvanhovell ; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (with newly added testcases) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12421 from dongjoon-hyun/SPARK-14664.	2016-04-27 21:36:19 +02:00
Yin Huai	54a3eb8312	[SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands ## What changes were proposed in this pull request? This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands. ## How was this patch tested? Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite. Author: Yin Huai <yhuai@databricks.com> Closes #12714 from yhuai/banNativeCommand.	2016-04-27 00:30:54 -07:00
Andrew Or	d8a83a564f	[SPARK-13477][SQL] Expose new user-facing Catalog interface ## What changes were proposed in this pull request? #12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface. ## How was this patch tested? See `CatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #12713 from andrewor14/user-facing-catalog.	2016-04-26 21:29:25 -07:00
Dilip Biswal	d93976d866	[SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS ## What changes were proposed in this pull request? This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands. Command Syntax: ``` SQL SHOW COLUMNS (FROM \| IN) table_identifier [(FROM \| IN) database] ``` ``` SQL SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] ``` ## How was this patch tested? Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite to verify plans. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #12222 from dilipbiswal/dkb_show_columns.	2016-04-27 09:28:24 +08:00
gatorsmile	162cf02efa	[SPARK-14910][SQL] Native DDL Command Support for Describe Function in Non-identifier Format #### What changes were proposed in this pull request? The existing `Describe Function` only support the function name in `identifier`. This is different from what Hive behaves. That is why many test cases `udf_abc` in `HiveCompatibilitySuite` are not using our native DDL support. For example, - udf_not.q - udf_bitwise_not.q This PR is to resolve the issues. Now, we can support the command of `Describe Function` whose function names are in the following format: - `qualifiedName` (e.g., `db.func1`) - `STRING` (e.g., `'func1'`) - `comparisonOperator` (e.g,. `<`) - `arithmeticOperator` (e.g., `+`) - `predicateOperator` (e.g., `or`) Note, before this PR, we only have a native command support when the function name is in the format of `qualifiedName`. #### How was this patch tested? Added test cases in `DDLSuite.scala`. Also manually verified all the related test cases in `HiveCompatibilitySuite` passed. Author: gatorsmile <gatorsmile@gmail.com> Closes #12679 from gatorsmile/descFunction.	2016-04-26 19:29:34 +02:00
Jacek Laskowski	b208229ba1	[MINOR][DOCS] Minor typo fixes ## What changes were proposed in this pull request? Minor typo fixes (too minor to deserve separate a JIRA) ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12469 from jaceklaskowski/minor-typo-fixes.	2016-04-26 11:51:12 +01:00
Reynold Xin	f36c9c8379	[SPARK-14888][SQL] UnresolvedFunction should use FunctionIdentifier ## What changes were proposed in this pull request? This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12659 from rxin/SPARK-14888.	2016-04-25 16:20:57 -07:00
gatorsmile	0c47e274ab	[SPARK-13739][SQL] Push Predicate Through Window #### What changes were proposed in this pull request? For performance, predicates can be pushed through Window if and only if the following conditions are satisfied: 1. All the expressions are part of window partitioning key. The expressions can be compound. 2. Deterministic #### How was this patch tested? TODO: - [X] DSL needs to be modified for window - [X] more tests will be added. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11635 from gatorsmile/pushPredicateThroughWindow.	2016-04-25 22:32:34 +02:00
Sameer Agarwal	cbdcd4edab	[SPARK-14870] [SQL] Fix NPE in TPCDS q14a ## What changes were proposed in this pull request? This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a. ## How was this patch tested? 1. Added regression test in `DataFrameAggregateSuite`. 2. Verified that TPCDS q14a works Author: Sameer Agarwal <sameer@databricks.com> Closes #12651 from sameeragarwal/tpcds-fix.	2016-04-24 22:52:50 -07:00
jliwork	f0f1a8afde	[SPARK-14548][SQL] Support not greater than and not less than operator in Spark SQL !< means not less than which is equivalent to >= !> means not greater than which is equivalent to <= I'd to create a PR to support these two operators. I've added new test cases in: DataFrameSuite, ExpressionParserSuite, JDBCSuite, PlanParserSuite, SQLQuerySuite dilipbiswal viirya gatorsmile Author: jliwork <jiali@us.ibm.com> Closes #12316 from jliwork/SPARK-14548.	2016-04-24 11:22:06 -07:00
gatorsmile	337289d712	[SPARK-14691][SQL] Simplify and Unify Error Generation for Unsupported Alter Table DDL #### What changes were proposed in this pull request? So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead. This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL. #### How was this patch tested? Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12459 from gatorsmile/cleanAlterTable.	2016-04-24 18:53:27 +02:00
Yin Huai	1672149c26	[SPARK-14879][SQL] Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core ## What changes were proposed in this pull request? CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12645 from yhuai/moveCreateDataSource.	2016-04-23 22:29:31 -07:00
Liang-Chi Hsieh	ba5e0b87a0	[SPARK-14838] [SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer ## What changes were proposed in this pull request? We have logical plans that produce domain objects which are `ObjectType`. As we can't estimate the size of `ObjectType`, we throw an `UnsupportedOperationException` if trying to do that. We should set a default size for `ObjectType` to avoid this failure. ## How was this patch tested? `DatasetSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12599 from viirya/skip-broadcast-objectproducer.	2016-04-23 21:15:31 -07:00
Dongjoon Hyun	bebb0240e6	[MINOR] [SQL] Fix error message string in nullSafeEvel of TernaryExpression ## What changes were proposed in this pull request? TernaryExpressions should thows proper error message for itself. ```scala protected def nullSafeEval(input1: Any, input2: Any, input3: Any): Any = - sys.error(s"BinaryExpressions must override either eval or nullSafeEval") + sys.error(s"TernaryExpressions must override either eval or nullSafeEval") ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12642 from dongjoon-hyun/minor_fix_error_msg_in_ternaryexpression.	2016-04-23 16:39:35 -07:00
Reynold Xin	890abd1279	[SPARK-14869][SQL] Don't mask exceptions in ResolveRelations ## What changes were proposed in this pull request? In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence. ## How was this patch tested? I manually hacked some bugs into Spark and made sure the exceptions were being propagated up. Author: Reynold Xin <rxin@databricks.com> Closes #12634 from rxin/SPARK-14869.	2016-04-23 12:49:36 -07:00
Reynold Xin	5c8a0ec99b	[SPARK-14872][SQL] Restructure command package ## What changes were proposed in this pull request? This patch restructures sql.execution.command package to break the commands into multiple files, in some logical organization: databases, tables, views, functions. I also renamed basicOperators.scala to basicLogicalOperators.scala and basicPhysicalOperators.scala. ## How was this patch tested? N/A - all I did was moving code around. Author: Reynold Xin <rxin@databricks.com> Closes #12636 from rxin/SPARK-14872.	2016-04-23 12:44:00 -07:00
Reynold Xin	95faa731c1	[SPARK-14866][SQL] Break SQLQuerySuite out into smaller test suites ## What changes were proposed in this pull request? This patch breaks SQLQuerySuite out into smaller test suites. It was a little bit too large for debugging. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #12630 from rxin/SPARK-14866.	2016-04-22 22:50:32 -07:00
Josh Rosen	bdde010edb	[SPARK-14863][SQL] Cache TreeNode's hashCode by default Caching TreeNode's `hashCode` can lead to orders-of-magnitude performance improvement in certain optimizer rules when operating on huge/complex schemas. Author: Josh Rosen <joshrosen@databricks.com> Closes #12626 from JoshRosen/cache-treenode-hashcode.	2016-04-23 13:42:44 +08:00
Reynold Xin	c06110187b	[SPARK-14842][SQL] Implement view creation in sql/core ## What changes were proposed in this pull request? This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand. ## How was this patch tested? All the code should've been tested by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12615 from rxin/SPARK-14842-2.	2016-04-22 20:30:51 -07:00
Reynold Xin	d7d0cad0ad	[SPARK-14855][SQL] Add "Exec" suffix to physical operators ## What changes were proposed in this pull request? This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12617 from rxin/exec-node.	2016-04-22 17:43:56 -07:00
Dongjoon Hyun	3647120a5a	[SPARK-14796][SQL] Add spark.sql.optimizer.inSetConversionThreshold config option. ## What changes were proposed in this pull request? Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10. This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that. After this PR, `OptimizerIn` is configurable. ```scala scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain() == Physical Plan == WholeStageCodegen : +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8] : +- INPUT +- Generate explode([1,2]), false, false, [a#7] +- Scan OneRowRelation[] scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2") scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain() == Physical Plan == WholeStageCodegen : +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17] : +- INPUT +- Generate explode([1,2]), false, false, [a#16] +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12562 from dongjoon-hyun/SPARK-14796.	2016-04-22 14:14:47 -07:00
Davies Liu	c417cec067	[SPARK-14763][SQL] fix subquery resolution ## What changes were proposed in this pull request? Currently, a column could be resolved wrongly if there are columns from both outer table and subquery have the same name, we should only resolve the attributes that can't be resolved within subquery. They may have same exprId than other attributes in subquery, so we should create alias for them. Also, the column in IN subquery could have same exprId, we should create alias for them. ## How was this patch tested? Added regression tests. Manually tests TPCDS Q70 and Q95, work well after this patch. Author: Davies Liu <davies@databricks.com> Closes #12539 from davies/fix_subquery.	2016-04-22 20:55:41 +02:00
Herman van Hovell	d060da098a	[SPARK-14762] [SQL] TPCDS Q90 fails to parse ### What changes were proposed in this pull request? TPCDS Q90 fails to parse because it uses a reserved keyword as an Identifier; `AT` was used as an alias for one of the subqueries. `AT` is not a reserved keyword and should have been registerd as a in the `nonReserved` rule. In order to prevent this from happening again I have added tests for all keywords that are non-reserved in Hive. See the `nonReserved`, `sql11ReservedKeywordsUsedAsCastFunctionName` & `sql11ReservedKeywordsUsedAsIdentifier` rules in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g. ### How was this patch tested? Added tests to for all Hive non reserved keywords to `TableIdentifierParserSuite`. cc davies Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12537 from hvanhovell/SPARK-14762.	2016-04-22 11:28:46 -07:00
Joan	bf95b8da27	[SPARK-6429] Implement hashCode and equals together ## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.	2016-04-22 12:24:12 +01:00
Liang-Chi Hsieh	e09ab5da8b	[SPARK-14609][SQL] Native support for LOAD DATA DDL command ## What changes were proposed in this pull request? Add the native support for LOAD DATA DDL command that loads data into Hive table/partition. ## How was this patch tested? `HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12412 from viirya/ddl-load-data.	2016-04-22 18:26:28 +08:00
Reynold Xin	284b15d2fb	[SPARK-14826][SQL] Remove HiveQueryExecution ## What changes were proposed in this pull request? This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12588 from rxin/SPARK-14826.	2016-04-22 01:31:13 -07:00
Reynold Xin	3405cc7758	[SPARK-14835][SQL] Remove MetastoreRelation dependency from SQLBuilder ## What changes were proposed in this pull request? This patch removes SQLBuilder's dependency on MetastoreRelation. We should be able to move SQLBuilder into the sql/core package after this change. ## How was this patch tested? N/A - covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12594 from rxin/SPARK-14835.	2016-04-21 21:48:48 -07:00
Sameer Agarwal	b29bc3f515	[SPARK-14680] [SQL] Support all datatypes to use VectorizedHashmap in TungstenAggregate ## What changes were proposed in this pull request? This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation. ## How was this patch tested? Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below). Author: Sameer Agarwal <sameer@databricks.com> Closes #12440 from sameeragarwal/all-datatypes.	2016-04-21 21:31:01 -07:00
Takuya UESHIN	f1fdb23821	[SPARK-14793] [SQL] Code generation for large complex type exceeds JVM size limit. ## What changes were proposed in this pull request? Code generation for complex type, `CreateArray`, `CreateMap`, `CreateStruct`, `CreateNamedStruct`, exceeds JVM size limit for large elements. We should split generated code into multiple `apply` functions if the complex types have large elements, like `UnsafeProjection` or others for large expressions. ## How was this patch tested? I added some tests to check if the generated codes for the expressions exceed or not. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #12559 from ueshin/issues/SPARK-14793.	2016-04-21 21:17:56 -07:00
Reynold Xin	f181aee07c	[SPARK-14821][SQL] Implement AnalyzeTable in sql/core and remove HiveSqlAstBuilder ## What changes were proposed in this pull request? This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder. In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12584 from rxin/SPARK-14821.	2016-04-21 17:41:29 -07:00
Eric Liang	e2b5647ab9	[SPARK-14724] Use radix sort for shuffles and sort operator when possible ## What changes were proposed in this pull request? Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible. The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR. ## How was this patch tested? Unit tests, enabling radix sort on existing tests. Microbenchmark results: ``` Running benchmark: radix sort 25000000 Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic Intel(R) Core(TM) i7-4600U CPU 2.10GHz radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X radix sort one byte 133 / 137 188.4 5.3 117.2X radix sort two bytes 255 / 258 98.2 10.2 61.1X radix sort eight bytes 991 / 997 25.2 39.6 15.7X radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X ``` I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort. ``` TPCDS on master: 2499s real time, 8185s executor - 1171s in TimSort, avg 267 MB/s (note the /s accounting is weird here since dataSize counts the record sizes too) TPCDS with radix enabled: 2294s real time, 7391s executor - 596s in TimSort, avg 254 MB/s - 26s in radix sort, avg 4.2 GB/s ``` cc davies rxin Author: Eric Liang <ekl@databricks.com> Closes #12490 from ericl/sort-benchmark.	2016-04-21 16:48:51 -07:00
Reynold Xin	1a95397bb6	[SPARK-14798][SQL] Move native command and script transformation parsing into SparkSqlAstBuilder ## What changes were proposed in this pull request? This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff. ## How was this patch tested? Updated test cases to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12564 from rxin/SPARK-14798.	2016-04-21 15:59:37 -07:00
Wenchen Fan	7abe9a6578	[SPARK-9013][SQL] generate MutableProjection directly instead of return a function `MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections. However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function. Author: Wenchen Fan <wenchen@databricks.com> Closes #7373 from cloud-fan/project.	2016-04-20 00:44:02 -07:00
Wenchen Fan	856bc465d5	[SPARK-14600] [SQL] Push predicates through Expand ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14600 This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12496 from cloud-fan/expand.	2016-04-19 21:53:19 -07:00
Joan	3ae25f244b	[SPARK-13929] Use Scala reflection for UDTs ## What changes were proposed in this pull request? Enable ScalaReflection and User Defined Types for plain Scala classes. This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet. ## How was this patch tested? Unit test Author: Joan <joan@goyeau.com> Closes #12149 from joan38/SPARK-13929-Scala-reflection.	2016-04-19 17:36:31 -07:00
Herman van Hovell	da8859226e	[SPARK-4226] [SQL] Support IN/EXISTS Subqueries ### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12306 from hvanhovell/SPARK-4226.	2016-04-19 15:16:02 -07:00
Wenchen Fan	5cb2e33609	[SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type ## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`. However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`. This PR fixes this issue by calling `distinct` before declare the class menbers. ## How was this patch tested? new regression test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12468 from cloud-fan/bug.	2016-04-19 10:51:58 -07:00
Josh Rosen	947b9020b0	[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread. This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`. I tested this manually using `16b31c8251`, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR. /cc rxin nongli yhuai anabranch Author: Josh Rosen <joshrosen@databricks.com> Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.	2016-04-19 10:38:10 -07:00
gatorsmile	d9620e769e	[SPARK-12457] Fixed the Wrong Description and Missing Example in Collection Functions #### What changes were proposed in this pull request? https://github.com/apache/spark/pull/12185 contains the original PR I submitted in https://github.com/apache/spark/pull/10418 However, it misses one of the extended example, a wrong description and a few typos for collection functions. This PR is fix all these issues. #### How was this patch tested? The existing test cases already cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #12492 from gatorsmile/expressionUpdate.	2016-04-19 10:33:40 -07:00
Wenchen Fan	9ee95b6ecc	[SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations ## What changes were proposed in this pull request? This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer. Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there. ## How was this patch tested? existing tests and new test in `EliminateSerializationSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12260 from cloud-fan/encoder.	2016-04-19 10:00:44 -07:00
Dongjoon Hyun	3d46d796a3	[SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option ## What changes were proposed in this pull request? We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf. ## How was this patch tested? Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12353 from dongjoon-hyun/SPARK-14577.	2016-04-19 21:38:15 +08:00
bomeng	74fe235ab5	[SPARK-14398][SQL] Audit non-reserved keyword list in ANTLR4 parser ## What changes were proposed in this pull request? I have compared non-reserved list in Antlr3 and Antlr4 one by one as well as all the existing keywords defined in Antlr4, added the missing keywords to the non-reserved keywords list. If we need to support more syntax, we can add more keywords by then. Any recommendation for the above is welcome. ## How was this patch tested? I manually checked the keywords one by one. Please let me know if there is a better way to test. Another thought: I suggest to put all the keywords definition and non-reserved list in order, that will be much easier to check in the future. Author: bomeng <bmeng@us.ibm.com> Closes #12191 from bomeng/SPARK-14398.	2016-04-19 09:09:58 +02:00
Sameer Agarwal	4eae1dbd7c	[SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode ## What changes were proposed in this pull request? The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation. ## How was this patch tested? Existing Tests Author: Sameer Agarwal <sameer@databricks.com> Closes #12483 from sameeragarwal/new-exprcode-2.	2016-04-18 20:28:22 -07:00
Sameer Agarwal	8bd8121329	[SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics ## What changes were proposed in this pull request? Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls. ## How was this patch tested? N/A (refactoring only) Author: Sameer Agarwal <sameer@databricks.com> Closes #12475 from sameeragarwal/gencode.	2016-04-18 14:03:40 -07:00
Reynold Xin	e4ae974294	[HOTFIX] Fix Scala 2.10 compilation break.	2016-04-18 12:57:23 -07:00
Dongjoon Hyun	d280d1da1a	[SPARK-14580][SPARK-14655][SQL] Hive IfCoercion should preserve predicate. ## What changes were proposed in this pull request? Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose return-type are null. However, some UDFs need evaluations because they are designed to throw exceptions. This PR fixes that to preserve the predicates. Also, `assert_true` is implemented as Spark SQL function. Before ``` scala> sql("select if(assert_true(false),2,3)").head res2: org.apache.spark.sql.Row = [3] ``` After ``` scala> sql("select if(assert_true(false),2,3)").head ... ASSERT_TRUE ... ``` Hive ``` hive> select if(assert_true(false),2,3); OK Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed. ``` ## How was this patch tested? Pass the Jenkins tests (including a new testcase in `HivePlanTest`) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12340 from dongjoon-hyun/SPARK-14580.	2016-04-18 12:26:56 -07:00
Tathagata Das	775cf17eaa	[SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming ## What changes were proposed in this pull request? There are many operations that are currently not supported in the streaming execution. For example: - joining two streams - unioning a stream and a batch source - sorting - window functions (not time windows) - distinct aggregates Furthermore, executing a query with a stream source as a batch query should also fail. This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not. ## How was this patch tested? unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12246 from tdas/SPARK-14473.	2016-04-18 11:09:33 -07:00
Dongjoon Hyun	432d1399cb	[SPARK-14614] [SQL] Add `bround` function ## What changes were proposed in this pull request? This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) Hive (1.3 ~ 2.0) ``` hive> select round(2.5), bround(2.5); OK 3.0 2.0 ``` After this PR ```scala scala> sql("select round(2.5), bround(2.5)").head res0: org.apache.spark.sql.Row = [3,2] ``` ## How was this patch tested? Pass the Jenkins tests (with extended tests). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12376 from dongjoon-hyun/SPARK-14614.	2016-04-18 10:44:51 -07:00
hyukjinkwon	9f678e9754	[MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations ## What changes were proposed in this pull request? This PR removes - Inappropriate type notations For example, from ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` to ```scala words.foreachRDD { (rdd, time) => ... ``` - Extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` and corrects some obvious style nits. ## How was this patch tested? This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly. The rules applied were below: - For the first correction, ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]$\s[^,]+s=>\s\{[^\}]+\}\s$</parameter></parameters> </check> ``` ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]\s[\{\|\(]([^\n>,]+=>)?\s\{([^()]\|(?R))\}^[,]</parameter></parameters> </check> ``` - For the second correction ```xml <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]\s[\{\|\(]\s\([^):]:R))\}^[,]</parameter></parameters> </check> ``` Those rules were not added Author: hyukjinkwon <gurwls223@gmail.com> Closes #12413 from HyukjinKwon/SPARK-style.	2016-04-16 14:56:23 +01:00
Reynold Xin	f4be0946af	[SPARK-14677][SQL] Make the max number of iterations configurable for Catalyst ## What changes were proposed in this pull request? We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization. ## How was this patch tested? Updated unit tests to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #12434 from rxin/SPARK-14677.	2016-04-15 20:28:09 -07:00
Yin Huai	b2dfa84959	[SPARK-14668][SQL] Move CurrentDatabase to Catalyst ## What changes were proposed in this pull request? This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following. ``` scala> sqlContext.sql("describe function extended current_database").collect.foreach(println) [Function: current_database] [Class: org.apache.spark.sql.execution.command.CurrentDatabase] [Usage: current_database() - Returns the current database.] [Extended Usage: > SELECT current_database()] ``` ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12424 from yhuai/SPARK-14668.	2016-04-15 17:48:41 -07:00
Wenchen Fan	297ba3f1b4	[SPARK-14275][SQL] Reimplement TypedAggregateExpression to DeclarativeAggregate ## What changes were proposed in this pull request? `ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient. One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12067 from cloud-fan/typed_udaf.	2016-04-15 12:10:00 +08:00
Dongjoon Hyun	d7e124edfe	[SPARK-14545][SQL] Improve `LikeSimplification` by adding `a%b` rule ## What changes were proposed in this pull request? Current `LikeSimplification` handles the following four rules. - 'a%' => expr.StartsWith("a") - '%b' => expr.EndsWith("b") - '%a%' => expr.Contains("a") - 'a' => EqualTo("a") This PR adds the following rule. - 'a%b' => expr.Length() >= 2 && expr.StartsWith("a") && expr.EndsWith("b") Here, 2 is statically calculated from "a".size + "b".size. Before ``` scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain() == Physical Plan == WholeStageCodegen : +- Filter a#5 LIKE a%c : +- INPUT +- Generate explode([abc,adc]), false, false, [a#5] +- Scan OneRowRelation[] ``` After ``` scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain() == Physical Plan == WholeStageCodegen : +- Filter ((length(a#5) >= 2) && (StartsWith(a#5, a) && EndsWith(a#5, c))) : +- INPUT +- Generate explode([abc,adc]), false, false, [a#5] +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12312 from dongjoon-hyun/SPARK-14545.	2016-04-14 13:34:29 -07:00
Liang-Chi Hsieh	28efdd3fd7	[SPARK-14592][SQL] Native support for CREATE TABLE LIKE DDL command ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-14592 This patch adds native support for DDL command `CREATE TABLE LIKE`. The SQL syntax is like: CREATE TABLE table_name LIKE existing_table CREATE TABLE IF NOT EXISTS table_name LIKE existing_table ## How was this patch tested? `HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> This patch had conflicts when merged, resolved by Committer: Andrew Or <andrew@databricks.com> Closes #12362 from viirya/create-table-like.	2016-04-14 11:08:08 -07:00
Liwei Lin	3e27940a19	[SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract methods should have explicit return types ## What changes were proposed in this pull request? Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110): ```scala def start() // should be: def start(): Unit def stop() // should be: def stop(): Unit ``` These methods exist in core, sql, streaming; this PR fixes them. ## How was this patch tested? N/A ## Which piece of scala style rule led to the changes? the rule was added separately in https://github.com/apache/spark/pull/12396 Author: Liwei Lin <lwlin7@gmail.com> Closes #12389 from lw-lin/public-abstract-methods.	2016-04-14 10:14:38 -07:00
hyukjinkwon	6fc3dc8839	[MINOR][SQL] Remove extra anonymous closure within functional transformations ## What changes were proposed in this pull request? This PR removes extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` ## How was this patch tested? Related unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12382 from HyukjinKwon/minor-extra-closers.	2016-04-14 09:43:41 +01:00
hyukjinkwon	b4819404a6	[SPARK-14596][SQL] Remove not used SqlNewHadoopRDD and some more unused imports ## What changes were proposed in this pull request? Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`. Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore. So, this PR removes `SqlNewHadoopRDD` and several unused imports. This was discussed in https://github.com/apache/spark/pull/12326. ## How was this patch tested? Several related existing unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12354 from HyukjinKwon/SPARK-14596.	2016-04-14 15:43:44 +08:00
Davies Liu	dbbe149070	[SPARK-14581] [SQL] push predicatese through more logical plans ## What changes were proposed in this pull request? Right now, filter push down only works with Project, Aggregate, Generate and Join, they can't be pushed through many other plans. This PR added support for Union, Intersect, Except and all unary plans. ## How was this patch tested? Added tests. Author: Davies Liu <davies@databricks.com> Closes #12342 from davies/filter_hint.	2016-04-13 13:01:13 -07:00
Andrew Or	7d2ed8cc03	[SPARK-14388][SQL] Implement CREATE TABLE ## What changes were proposed in this pull request? This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive. WIP: Note that I haven't verified whether this actually works yet! But I believe it does. ## How was this patch tested? Tests will come in a future commit. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12271 from andrewor14/create-table-ddl.	2016-04-13 11:08:34 -07:00
Davies Liu	372baf0479	[SPARK-14578] [SQL] Fix codegen for CreateExternalRow with nested wide schema ## What changes were proposed in this pull request? The wide schema, the expression of fields will be splitted into multiple functions, but the variable for loopVar can't be accessed in splitted functions, this PR change them as class member. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #12338 from davies/nested_row.	2016-04-12 17:26:37 -07:00
bomeng	bcd2076274	[SPARK-14414][SQL] improve the error message class hierarchy ## What changes were proposed in this pull request? Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also minimum code impact to the current implementation by changing the class hierarchy. 1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string. 2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way. ## How was this patch tested? The existing test cases should cover this patch. Author: bomeng <bmeng@us.ibm.com> Closes #12314 from bomeng/SPARK-14414.	2016-04-12 13:43:39 -07:00
Davies Liu	85e68b4bea	[SPARK-14562] [SQL] improve constraints propagation in Union ## What changes were proposed in this pull request? Currently, Union only takes intersect of the constraints from it's children, all others are dropped, we should try to merge them together. This PR try to merge the constraints that have the same reference but came from different children, for example: `a > 10` and `a < 100` could be merged as `a > 10 \|\| a < 100`. ## How was this patch tested? Added more cases in existing test. Author: Davies Liu <davies@databricks.com> Closes #12328 from davies/union_const.	2016-04-12 12:29:54 -07:00
Dongjoon Hyun	b0f5497e95	[SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase` ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.	2016-04-12 00:43:28 -07:00
Andrew Or	83fb96403b	[SPARK-14132][SPARK-14133][SQL] Alter table partition DDLs ## What changes were proposed in this pull request? This implements a few alter table partition commands using the `SessionCatalog`. In particular: ``` ALTER TABLE ... ADD PARTITION ... ALTER TABLE ... DROP PARTITION ... ALTER TABLE ... RENAME PARTITION ... TO ... ``` The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them: ``` ALTER TABLE ... EXCHANGE PARTITION ... ALTER TABLE ... ARCHIVE PARTITION ... ALTER TABLE ... UNARCHIVE PARTITION ... ALTER TABLE ... TOUCH ... ALTER TABLE ... COMPACT ... ALTER TABLE ... CONCATENATE MSCK REPAIR TABLE ... ``` ## How was this patch tested? `DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #12220 from andrewor14/alter-partition-ddl.	2016-04-11 20:59:45 -07:00
Dongjoon Hyun	5de26194a3	[SPARK-14502] [SQL] Add optimization for Binary Comparison Simplification ## What changes were proposed in this pull request? We can simplifies binary comparisons with semantically-equal operands: 1. Replace '<=>' with 'true' literal. 2. Replace '=', '<=', and '>=' with 'true' literal if both operands are non-nullable. 3. Replace '<' and '>' with 'false' literal if both operands are non-nullable. For example, the following example plan ``` scala> sql("SELECT * FROM (SELECT explode(array(1,2,3)) a) T WHERE a BETWEEN a AND a+7").explain() ... : +- Filter ((a#59 >= a#59) && (a#59 <= (a#59 + 7))) ... ``` will be optimized into the following. ``` : +- Filter (a#47 <= (a#47 + 7)) ``` ## How was this patch tested? Pass the Jenkins tests including new `BinaryComparisonSimplificationSuite`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12267 from dongjoon-hyun/SPARK-14502.	2016-04-11 09:52:50 -07:00
Davies Liu	652c470309	[SPARK-14528] [SQL] Fix same result of Union ## What changes were proposed in this pull request? This PR fix resultResult() for Union. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #12295 from davies/fix_sameResult.	2016-04-11 09:43:16 -07:00
gatorsmile	9f838bd242	[SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table #### What changes were proposed in this pull request? This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`. #### How was this patch tested? Modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12284 from gatorsmile/followupDropTable.	2016-04-10 20:46:15 -07:00
Dongjoon Hyun	a7ce473bd0	[SPARK-14415][SQL] All functions should show usages by command `DESC FUNCTION` ## What changes were proposed in this pull request? Currently, many functions do now show usages like the followings. ``` scala> sql("desc function extended `sin`").collect().foreach(println) [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: To be added.] [Extended Usage: To be added.] ``` This PR adds descriptions for functions and adds a testcase prevent adding function without usage. ``` scala> sql("desc function extended `sin`").collect().foreach(println); [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: sin(x) - Returns the sine of x.] [Extended Usage: > SELECT sin(0); 0.0] ``` The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`. ## How was this patch tested? Pass the Jenkins tests (including new testcases.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12185 from dongjoon-hyun/SPARK-14415.	2016-04-10 11:46:45 -07:00
Yin Huai	3fb09afd5e	[SPARK-14506][SQL] HiveClientImpl's toHiveTable misses a table property for external tables ## What changes were proposed in this pull request? For an external table's metadata (in Hive's representation), its table type needs to be EXTERNAL_TABLE. Also, there needs to be a field called EXTERNAL set in the table property with a value of TRUE (for a MANAGED_TABLE it will be FALSE) based on https://github.com/apache/hive/blob/release-1.2.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105. HiveClientImpl's toHiveTable misses to set this table property. ## How was this patch tested? Added a new test. Author: Yin Huai <yhuai@databricks.com> Closes #12275 from yhuai/SPARK-14506.	2016-04-09 23:32:17 -07:00
gatorsmile	dfce9665c4	[SPARK-14362][SPARK-14406][SQL] DDL Native Support: Drop View and Drop Table #### What changes were proposed in this pull request? This PR is to provide a native support for DDL `DROP VIEW` and `DROP TABLE`. The PR includes native parsing and native analysis. Based on the HIVE DDL document for [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL- DropView ), `DROP VIEW` is defined as, Syntax: ```SQL DROP VIEW [IF EXISTS] [db_name.]view_name; ``` - to remove metadata for the specified view. - illegal to use DROP TABLE on a view. - illegal to use DROP VIEW on a table. - this command only works in `HiveContext`. In `SQLContext`, we will get an exception. This PR also handles `DROP TABLE`. Syntax: ```SQL DROP TABLE [IF EXISTS] table_name [PURGE]; ``` - Previously, the `DROP TABLE` command only can drop Hive tables in `HiveContext`. Now, after this PR, this command also can drop temporary table, external table, external data source table in `SQLContext`. - In `HiveContext`, we will not issue an exception if the to-be-dropped table does not exist and users did not specify `IF EXISTS`. Instead, we just log an error message. If `IF EXISTS` is specified, we will not issue any error message/exception. - In `SQLContext`, we will issue an exception if the to-be-dropped table does not exist, unless `IF EXISTS` is specified. - Data will not be deleted if the tables are `external`, unless table type is `managed_table`. #### How was this patch tested? For verifying command parsing, added test cases in `spark/sql/hive/HiveDDLCommandSuite.scala` For verifying command analysis, added test cases in `spark/sql/hive/execution/HiveDDLSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12146 from gatorsmile/dropView.	2016-04-09 17:40:36 -07:00
Yong Tang	cd2fed7012	[SPARK-14335][SQL] Describe function command returns wrong output ## What changes were proposed in this pull request? …because some of built-in functions are not in function registry. This fix tries to fix issues in `describe function` command where some of the outputs still shows Hive's function because some built-in functions are not in FunctionRegistry. The following built-in functions have been added to FunctionRegistry: ``` - ! * / & % ^ + < <= <=> = == > >= \| ~ and in like not or rlike when ``` The following listed functions are not added, but hard coded in `commands.scala` (hvanhovell): ``` != <> between case ``` Below are the existing result of the above functions that have not been added: ``` spark-sql> describe function `!=`; Function: <> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual Usage: a <> b - Returns TRUE if a is not equal to b ``` ``` spark-sql> describe function `<>`; Function: <> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual Usage: a <> b - Returns TRUE if a is not equal to b ``` ``` spark-sql> describe function `between`; Function: between Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFBetween Usage: between a [NOT] BETWEEN b AND c - evaluate if a is [not] in between b and c ``` ``` spark-sql> describe function `case`; Function: case Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFCase Usage: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f ``` ## How was this patch tested? Existing tests passed. Additional test cases added. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12128 from yongtang/SPARK-14335.	2016-04-09 13:54:30 -07:00
bomeng	10a95781ee	[SPARK-14496][SQL] fix some javadoc typos ## What changes were proposed in this pull request? Minor issues. Found 2 typos while browsing the code. ## How was this patch tested? None. Author: bomeng <bmeng@us.ibm.com> Closes #12264 from bomeng/SPARK-14496.	2016-04-09 22:30:54 +09:00
Jacek Laskowski	6447098013	[SPARK-14402][HOTFIX] Fix ExpressionDescription annotation ## What changes were proposed in this pull request? Fix for the error introduced in `c59abad052`: ``` /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626: error: annotation argument needs to be a constant; found: "_FUNC_(str) - ".+("Returns str, with the first letter of each word in uppercase, all other letters in ").+("lowercase. Words are delimited by white space.") "Returns str, with the first letter of each word in uppercase, all other letters in " + ^ ``` ## How was this patch tested? Local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12192 from jaceklaskowski/SPARK-14402-HOTFIX.	2016-04-08 11:36:41 +01:00
Wenchen Fan	49fb237081	[SPARK-14270][SQL] whole stage codegen support for typed filter ## What changes were proposed in this pull request? We implement typed filter by `MapPartitions`, which doesn't work well with whole stage codegen. This PR use `Filter` to implement typed filter and we can get the whole stage codegen support for free. This PR also introduced `DeserializeToObject` and `SerializeFromObject`, to seperate serialization logic from object operator, so that it's eaiser to write optimization rules for adjacent object operators. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12061 from cloud-fan/whole-stage-codegen.	2016-04-07 17:23:34 -07:00
Andrew Or	ae1db91d15	[SPARK-14410][SQL] Push functions existence check into catalog ## What changes were proposed in this pull request? This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog. No change in functionality is expected. ## How was this patch tested? `SessionCatalogSuite`, `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12198 from andrewor14/function-exists.	2016-04-07 16:23:17 -07:00
Davies Liu	aa852215f8	[SPARK-12740] [SPARK-13932] support grouping()/grouping_id() in having/order clause ## What changes were proposed in this pull request? This PR brings the support of using grouping()/grouping_id() in HAVING/ORDER BY clause. The resolved grouping()/grouping_id() will be replaced by unresolved "spark_gropuing_id" virtual attribute, then resolved by ResolveMissingAttribute. This PR also fix the HAVING clause that access a grouping column that is not presented in SELECT clause, for example: ```sql select count(1) from (select 1 as a) t group by a having a > 0 ``` ## How was this patch tested? Add new tests. Author: Davies Liu <davies@databricks.com> Closes #12235 from davies/grouping_having.	2016-04-07 11:51:34 -07:00
Reynold Xin	e11aa9ec5c	[SPARK-14452][SQL] Explicit APIs in Scala for specifying encoders ## What changes were proposed in this pull request? The Scala Dataset public API currently only allows users to specify encoders through SQLContext.implicits. This is OK but sometimes people want to explicitly get encoders without a SQLContext (e.g. Aggregator implementations). This patch adds public APIs to Encoders class for getting Scala encoders. ## How was this patch tested? None - I will update test cases once https://github.com/apache/spark/pull/12231 is merged. Author: Reynold Xin <rxin@databricks.com> Closes #12232 from rxin/SPARK-14452.	2016-04-07 00:46:57 -07:00
Marcelo Vanzin	21d5ca128b	[SPARK-14134][CORE] Change the package name used for shading classes. The current package name uses a dash, which is a little weird but seemed to work. That is, until a new test tried to mock a class that references one of those shaded types, and then things started failing. Most changes are just noise to fix the logging configs. For reference, SPARK-8815 also raised this issue, although at the time it did not cause any issues in Spark, so it was not addressed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11941 from vanzin/SPARK-14134.	2016-04-06 19:33:51 -07:00
Herman van Hovell	d76592276f	[SPARK-12610][SQL] Left Anti Join ### What changes were proposed in this pull request? This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys. We currently add support for the following SQL join syntax: SELECT * FROM tbl1 A LEFT ANTI JOIN tbl2 B ON A.Id = B.Id Or using a dataframe: tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti) This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator. The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due. This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`. ### How was this patch tested? Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563. cc davies chenghao-intel rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12214 from hvanhovell/SPARK-12610.	2016-04-06 19:25:10 -07:00
Dongjoon Hyun	d717ae1fd7	[SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.	2016-04-06 16:02:55 -07:00
Davies Liu	5a4b11a901	[SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table ## What changes were proposed in this pull request? 1) fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions. 2) Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan 3) Disable the returning columnar batch in parquet reader if there are many columns. 4) Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen. Closes #12098 ## How was this patch tested? Add a tests for table with 1000 columns. Author: Davies Liu <davies@databricks.com> Closes #12047 from davies/many_columns.	2016-04-06 15:33:39 -07:00
bomeng	3c8d882165	[SPARK-14383][SQL] missing "\|" in the g4 file ## What changes were proposed in this pull request? A very trivial one. It missed "\|" between DISTRIBUTE and UNSET. ## How was this patch tested? I do not think it is really needed. Author: bomeng <bmeng@us.ibm.com> Closes #12156 from bomeng/SPARK-14383.	2016-04-06 11:12:48 -07:00
bomeng	5abd02c02b	[SPARK-14429][SQL] Improve LIKE pattern in "SHOW TABLES / FUNCTIONS LIKE <pattern>" DDL LIKE <pattern> is commonly used in SHOW TABLES / FUNCTIONS etc DDL. In the pattern, user can use `\|` or `` as wildcards. 1. Currently, we used `replaceAll()` to replace `` with `.`, but the replacement was scattered in several places; I have created an utility method and use it in all the places; 2. Consistency with Hive: the pattern is case insensitive in Hive and white spaces will be trimmed, but current pattern matching does not do that. For example, suppose we have tables (t1, t2, t3), `SHOW TABLES LIKE ' T ' ` will list all the t-tables. Please use Hive to verify it. 3. Combined with `\|`, the result will be sorted. For pattern like `' B\|a '`, it will list the result in a-b order. I've made some changes to the utility method to make sure we will get the same result as Hive does. A new method was created in StringUtil and test cases were added. andrewor14 Author: bomeng <bmeng@us.ibm.com> Closes #12206 from bomeng/SPARK-14429.	2016-04-06 11:06:14 -07:00
Kousuke Saruta	10494feae0	[SPARK-14426][SQL] Merge PerserUtils and ParseUtils ## What changes were proposed in this pull request? We have ParserUtils and ParseUtils which are both utility collections for use during the parsing process. Those names and what they are used for is very similar so I think we can merge them. Also, the original unescapeSQLString method may have a fault. When "\u0061" style character literals are passed to the method, it's not unescaped successfully. This patch fix the bug. ## How was this patch tested? Added a new test case. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12199 from sarutak/merge-ParseUtils-and-ParserUtils.	2016-04-06 10:57:46 -07:00
Wenchen Fan	f6456fa80b	[SPARK-14296][SQL] whole stage codegen support for Dataset.map ## What changes were proposed in this pull request? This PR adds a new operator `MapElements` for `Dataset.map`, it's a 1-1 mapping and is easier to adapt to whole stage codegen framework. ## How was this patch tested? new test in `WholeStageCodegenSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12087 from cloud-fan/map.	2016-04-06 12:09:10 +08:00
Andrew Or	45d8cdee39	[SPARK-14129][SPARK-14128][SQL] Alter table DDL commands ## What changes were proposed in this pull request? In Spark 2.0, we want to handle the most common `ALTER TABLE` commands ourselves instead of passing the entire query text to Hive. This is done using the new `SessionCatalog` API introduced recently. The commands supported in this patch include: ``` ALTER TABLE ... RENAME TO ... ALTER TABLE ... SET TBLPROPERTIES ... ALTER TABLE ... UNSET TBLPROPERTIES ... ALTER TABLE ... SET LOCATION ... ALTER TABLE ... SET SERDE ... ``` The commands we explicitly do not support are: ``` ALTER TABLE ... CLUSTERED BY ... ALTER TABLE ... SKEWED BY ... ALTER TABLE ... NOT CLUSTERED ALTER TABLE ... NOT SORTED ALTER TABLE ... NOT SKEWED ALTER TABLE ... NOT STORED AS DIRECTORIES ``` For these we throw exceptions complaining that they are not supported. ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12121 from andrewor14/alter-table-ddl.	2016-04-05 14:54:07 -07:00
Dongjoon Hyun	c59abad052	[SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string ## What changes were proposed in this pull request? Current, SparkSQL `initCap` is using `toTitleCase` function. However, `UTF8String.toTitleCase` implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation `toTitleCase`. ``` hive> select initcap('sParK'); Spark ``` ``` scala> sql("select initcap('sParK')").head res0: org.apache.spark.sql.Row = [SParK] ``` This PR updates the implementation of `initcap` using `toLowerCase` and `toTitleCase`. ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12175 from dongjoon-hyun/SPARK-14402.	2016-04-05 13:31:00 -07:00
Burak Yavuz	9ee5c25717	[SPARK-14353] Dataset Time Window `window` API for Python, and SQL ## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the Python, and SQL, API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python, users can access all APIs above, but in addition they can do - In Python: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12136 from brkyvz/python-windows.	2016-04-05 13:18:39 -07:00
Yin Huai	72544d6f2a	[SPARK-14123][SPARK-14384][SQL] Handle CreateFunction/DropFunction ## What changes were proposed in this pull request? This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes. * `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted. * SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`. * A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`. This PR is based on viirya's https://github.com/apache/spark/pull/12036/ ## How was this patch tested? Existing tests and new tests. ## TODOs [x] Self-review [x] Cleanup [x] More tests for create/drop functions (we need to more tests for permanent functions). [ ] File JIRAs for all TODOs [x] Standardize the error message when a function does not exist. Author: Yin Huai <yhuai@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12117 from yhuai/function.	2016-04-05 12:27:06 -07:00
Wenchen Fan	f77f11c671	[SPARK-14345][SQL] Decouple deserializer expression resolution from ObjectOperator ## What changes were proposed in this pull request? This PR decouples deserializer expression resolution from `ObjectOperator`, so that we can use deserializer expression in normal operators. This is needed by #12061 and #12067 , I abstracted the logic out and put them in this PR to reduce code change in the future. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12131 from cloud-fan/separate.	2016-04-05 10:53:54 -07:00
gatorsmile	7807173679	[SPARK-14349][SQL] Issue Error Messages for Unsupported Operators/DML/DDL in SQL Context. #### What changes were proposed in this pull request? Currently, the weird error messages are issued if we use Hive Context-only operations in SQL Context. For example, - When calling `Drop Table` in SQL Context, we got the following message: ``` Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be thrown, but java.lang.ClassCastException was thrown. ``` - When calling `Script Transform` in SQL Context, we got the message: ``` assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, [tKey#155,tValue#156], null +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at BeforeAndAfterAll.scala:187 ``` Updates: Based on the investigation from hvanhovell , the root cause is `visitChildren`, which is the default implementation. It always returns the result of the last defined context child. After merging the code changes from hvanhovell , it works! Thank you hvanhovell ! #### How was this patch tested? A few test cases are added. Not sure if the same issue exist for the other operators/DDL/DML. hvanhovell Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12134 from gatorsmile/hiveParserCommand.	2016-04-05 11:19:46 +02:00
Dilip Biswal	2715bc68bd	[SPARK-14348][SQL] Support native execution of SHOW TBLPROPERTIES command ## What changes were proposed in this pull request? This PR adds Native execution of SHOW TBLPROPERTIES command. Command Syntax: ``` SQL SHOW TBLPROPERTIES table_name[(property_key_literal)] ``` ## How was this patch tested? Tests added in HiveComandSuiie and DDLCommandSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #12133 from dilipbiswal/dkb_show_tblproperties.	2016-04-05 08:41:59 +02:00
Dongjoon Hyun	3f749f7ed4	[SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results ## What changes were proposed in this pull request? This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines). - Fix typos(exception/log strings, testcase name, comments) in 44 lines. - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011) - Use diamond operators in 40 lines. (New codes after SPARK-13702) - Fix redundant semicolon in 5 lines. - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala. ## How was this patch tested? Manual and pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12139 from dongjoon-hyun/SPARK-14355.	2016-04-03 18:14:16 -07:00
bomeng	c238cd0744	[SPARK-14341][SQL] Throw exception on unsupported create / drop macro ddl ## What changes were proposed in this pull request? We throw an AnalysisException that looks like this: ``` scala> sqlContext.sql("CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x))") org.apache.spark.sql.catalyst.parser.ParseException: Unsupported SQL statement == SQL == CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x)) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:749) ... 48 elided ``` ## How was this patch tested? Add test cases in HiveQuerySuite.scala Author: bomeng <bmeng@us.ibm.com> Closes #12125 from bomeng/SPARK-14341.	2016-04-03 17:15:02 +02:00
Reynold Xin	7be4620508	[HOTFIX] Fix Scala 2.10 compilation	2016-04-02 23:05:23 -07:00
Dongjoon Hyun	4a6e78abd9	[MINOR][DOCS] Use multi-line JavaDoc comments in Scala code. ## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.	2016-04-02 17:50:40 -07:00
Dongjoon Hyun	f705037617	[SPARK-14338][SQL] Improve `SimplifyConditionals` rule to handle `null` in IF/CASEWHEN ## What changes were proposed in this pull request? Currently, `SimplifyConditionals` handles `true` and `false` to optimize branches. This PR improves `SimplifyConditionals` to take advantage of `null` conditions for `if` and `CaseWhen` expressions, too. Before ``` scala> sql("SELECT IF(null, 1, 0)").explain() == Physical Plan == WholeStageCodegen : +- Project [if (null) 1 else 0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4] : +- INPUT +- Scan OneRowRelation[] scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain() == Physical Plan == WholeStageCodegen : +- Project [CASE WHEN null THEN 1 ELSE 2 END AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#14] : +- INPUT +- Scan OneRowRelation[] ``` After ``` scala> sql("SELECT IF(null, 1, 0)").explain() == Physical Plan == WholeStageCodegen : +- Project [0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4] : +- INPUT +- Scan OneRowRelation[] scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain() == Physical Plan == WholeStageCodegen : +- Project [2 AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#4] : +- INPUT +- Scan OneRowRelation[] ``` Hive ``` hive> select if(null,1,2); OK 2 hive> select case when cast(null as boolean) then 1 else 2 end; OK 2 ``` ## How was this patch tested? Pass the Jenkins tests (including new extended test cases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12122 from dongjoon-hyun/SPARK-14338.	2016-04-02 17:48:53 -07:00
Jacek Laskowski	06694f1c68	[MINOR] Typo fixes ## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.	2016-04-02 08:12:04 -07:00
Dongjoon Hyun	fa1af0aff7	[SPARK-14251][SQL] Add SQL command for printing out generated code for debugging ## What changes were proposed in this pull request? This PR implements `EXPLAIN CODEGEN` SQL command which returns generated codes like `debugCodegen`. In `spark-shell`, we don't need to `import debug` module. In `spark-sql`, we can use this SQL command now. Before ``` scala> import org.apache.spark.sql.execution.debug._ scala> sql("select 'a' as a group by 1").debugCodegen() Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == ... Generated code: ... == Subtree 2 / 2 == ... Generated code: ... ``` After ``` scala> sql("explain extended codegen select 'a' as a group by 1").collect().foreach(println) [Found 2 WholeStageCodegen subtrees.] [== Subtree 1 / 2 ==] ... [] [Generated code:] ... [] [== Subtree 2 / 2 ==] ... [] [Generated code:] ... ``` ## How was this patch tested? Pass the Jenkins tests (including new testcases) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12099 from dongjoon-hyun/SPARK-14251.	2016-04-01 22:45:52 -07:00
Cheng Lian	27e71a2cd9	[SPARK-14244][SQL] Don't use SizeBasedWindowFunction.n created on executor side when evaluating window functions ## What changes were proposed in this pull request? `SizeBasedWindowFunction.n` is a global singleton attribute created for evaluating size based aggregate window functions like `CUME_DIST`. However, this attribute gets different expression IDs when created on both driver side and executor side. This PR adds `withPartitionSize` method to `SizeBasedWindowFunction` so that we can easily rewrite `SizeBasedWindowFunction.n` on executor side. ## How was this patch tested? A test case is added in `HiveSparkSubmitSuite`, which supports launching multi-process clusters. Author: Cheng Lian <lian@databricks.com> Closes #12040 from liancheng/spark-14244-fix-sized-window-function.	2016-04-01 22:00:24 -07:00
Michael Armbrust	0fc4aaa71c	[SPARK-14255][SQL] Streaming Aggregation This PR adds the ability to perform aggregations inside of a `ContinuousQuery`. In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`. Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression: - Partial Aggregation - Shuffle - Partial Merge (now there is at most 1 tuple per group) - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous) - Partial Merge (now there is at most 1 tuple per group) - StateStoreSave (saves the tuple for the next batch) - Complete (output the current result of the aggregation) The following refactoring was also performed to allow us to plug into existing code: - The get/put implementation is taken from #12013 - The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation` - The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup. - Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case. - The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes. Author: Michael Armbrust <michael@databricks.com> Closes #12048 from marmbrus/statefulAgg.	2016-04-01 15:15:16 -07:00
Burak Yavuz	1b829ce139	[SPARK-14160] Time Windowing functions for Datasets ## What changes were proposed in this pull request? This PR adds the function `window` as a column expression. `window` can be used to bucket rows into time windows given a time column. With this expression, performing time series analysis on batch data, as well as streaming data should become much more simpler. ### Usage Assume the following schema: `sensor_id, measurement, timestamp` To average 5 minute data every 1 minute (window length of 5 minutes, slide duration of 1 minute), we will use: ```scala df.groupBy(window("timestamp", “5 minutes”, “1 minute”), "sensor_id") .agg(mean("measurement").as("avg_meas")) ``` This will generate windows such as: ``` 09:00:00-09:05:00 09:01:00-09:06:00 09:02:00-09:07:00 ... ``` Intervals will start at every `slideDuration` starting at the unix epoch (1970-01-01 00:00:00 UTC). To start intervals at a different point of time, e.g. 30 seconds after a minute, the `startTime` parameter can be used. ```scala df.groupBy(window("timestamp", “5 minutes”, “1 minute”, "30 second"), "sensor_id") .agg(mean("measurement").as("avg_meas")) ``` This will generate windows such as: ``` 09:00:30-09:05:30 09:01:30-09:06:30 09:02:30-09:07:30 ... ``` Support for Python will be made in a follow up PR after this. ## How was this patch tested? This patch has some basic unit tests for the `TimeWindow` expression testing that the parameters pass validation, and it also has some unit/integration tests testing the correctness of the windowing and usability in complex operations (multi-column grouping, multi-column projections, joins). Author: Burak Yavuz <brkyvz@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #12008 from brkyvz/df-time-window.	2016-04-01 13:19:24 -07:00
Liang-Chi Hsieh	a884daad80	[SPARK-14191][SQL] Remove invalid Expand operator constraints `Expand` operator now uses its child plan's constraints as its valid constraints (i.e., the base of constraints). This is not correct because `Expand` will set its group by attributes to null values. So the nullability of these attributes should be true. E.g., for an `Expand` operator like: val input = LocalRelation('a.int, 'b.int, 'c.int).where('c.attr > 10 && 'a.attr < 5 && 'b.attr > 2) Expand( Seq( Seq('c, Literal.create(null, StringType), 1), Seq('c, 'a, 2)), Seq('c, 'a, 'gid.int), Project(Seq('a, 'c), input)) The `Project` operator has the constraints `IsNotNull('a)`, `IsNotNull('b)` and `IsNotNull('c)`. But the `Expand` should not have `IsNotNull('a)` in its constraints. This PR is the first step for this issue and remove invalid constraints of `Expand` operator. A test is added to `ConstraintPropagationSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Michael Armbrust <michael@databricks.com> Closes #11995 from viirya/fix-expand-constraints.	2016-04-01 13:08:09 -07:00
Liang-Chi Hsieh	df68beb85d	[SPARK-13995][SQL] Extract correct IsNotNull constraints for Expression ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13995 We infer relative `IsNotNull` constraints from logical plan's expressions in `constructIsNotNullConstraints` now. However, we don't consider the case of (nested) `Cast`. For example: val tr = LocalRelation('a.int, 'b.long) val plan = tr.where('a.attr === 'b.attr).analyze Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr, "a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR fixes it. Besides, as `IsNotNull` constraints are most useful for `Attribute`, we should do recursing through any `Expression` that is null intolerant and construct `IsNotNull` constraints for all `Attribute`s under these Expressions. For example, consider the following constraints: val df = Seq((1,2,3)).toDF("a", "b", "c") df.where("a + b = c").queryExecution.analyzed.constraints The inferred isnotnull constraints should be isnotnull(a), isnotnull(b), isnotnull(c), instead of isnotnull(a + c) and isnotnull(c). ## How was this patch tested? Test is added into `ConstraintPropagationSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11809 from viirya/constraint-cast.	2016-04-01 13:00:55 -07:00
sureshthalamati	a471c7f9ea	[SPARK-14133][SQL] Throws exception for unsupported create/drop/alter index , and lock/unlock operations. ## What changes were proposed in this pull request? This PR throws Unsupported Operation exception for create index, drop index, alter index , lock table , lock database, unlock table, and unlock database operations that are not supported in Spark SQL. Currently these operations are executed executed by Hive. Error: spark-sql> drop index my_index on my_table; Error in query: Unsupported operation: drop index(line 1, pos 0) ## How was this patch tested? Added test cases to HiveQuerySuite yhuai hvanhovell andrewor14 Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #12069 from sureshthalamati/unsupported_ddl_spark-14133.	2016-04-01 18:33:31 +02:00
Dilip Biswal	0b04f8fdf1	[SPARK-14184][SQL] Support native execution of SHOW DATABASE command and fix SHOW TABLE to use table identifier pattern ## What changes were proposed in this pull request? This PR addresses the following 1. Supports native execution of SHOW DATABASES command 2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if supplied. SHOW TABLE syntax ``` SHOW TABLES [IN database_name] ['identifier_with_wildcards']; ``` SHOW DATABASES syntax ``` SHOW (DATABASES\|SCHEMAS) [LIKE 'identifier_with_wildcards']; ``` ## How was this patch tested? Tests added in SQLQuerySuite (both hive and sql contexts) and DDLCommandSuite Note: Since the table name pattern was not working , tests are added in both SQLQuerySuite to verify the application of the table pattern. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11991 from dilipbiswal/dkb_show_database.	2016-04-01 18:27:11 +02:00
gatorsmile	446c45bd87	[SPARK-14182][SQL] Parse DDL Command: Alter View This PR is to provide native parsing support for DDL commands: `Alter View`. Since its AST trees are highly similar to `Alter Table`. Thus, both implementation are integrated into the same one. Based on the Hive DDL document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and https://cwiki.apache.org/confluence/display/Hive/PartitionedViews Syntax: ```SQL ALTER VIEW view_name RENAME TO new_view_name ``` - to change the name of a view to a different name Syntax: ```SQL ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment); ``` - to add metadata to a view Syntax: ```SQL ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key') ``` - to remove metadata from a view Syntax: ```SQL ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, ...] ``` - to add the partitioning metadata for a view. - the syntax of partition spec in `ALTER VIEW` is identical to `ALTER TABLE`, EXCEPT that it is ILLEGAL to specify a `LOCATION` clause. Syntax: ```SQL ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] ``` - to drop the related partition metadata for a view. Added the related test cases to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11987 from gatorsmile/parseAlterView.	2016-03-31 12:04:03 -07:00
Herman van Hovell	a9b93e0739	[SPARK-14211][SQL] Remove ANTLR3 based parser ### What changes were proposed in this pull request? This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`. ### How was this patch tested? Existing unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12071 from hvanhovell/SPARK-14211.	2016-03-31 09:25:09 -07:00
Dongjoon Hyun	258a243419	[SPARK-14282][SQL] CodeFormatter should handle oneline comment with /* / properly ## What changes were proposed in this pull request? This PR improves `CodeFormatter` to fix the following malformed indentations. ```java / 019 / public java.lang.Object apply(java.lang.Object _i) { / 020 / InternalRow i = (InternalRow) _i; / 021 / / createexternalrow(if (isnull(input[0, double])) null else input[0, double], if (isnull(input[1, int])) null else input[1, int], ... / / 022 / boolean isNull = false; / 023 / final Object[] values = new Object[2]; / 024 / / if (isnull(input[0, double])) null else input[0, double] / / 025 / / isnull(input[0, double]) / ... / 053 / if (!false && false) { / 054 / / null / / 055 / final int value9 = -1; / 056 / isNull6 = true; / 057 / value6 = value9; / 058 / } else { ... / 077 / return mutableRow; / 078 / } / 079 / } / 080 / ``` After this PR, the code will be formatted like the following. ```java / 019 / public java.lang.Object apply(java.lang.Object _i) { / 020 / InternalRow i = (InternalRow) _i; / 021 / / createexternalrow(if (isnull(input[0, double])) null else input[0, double], if (isnull(input[1, int])) null else input[1, int], ... / / 022 / boolean isNull = false; / 023 / final Object[] values = new Object[2]; / 024 / / if (isnull(input[0, double])) null else input[0, double] / / 025 / / isnull(input[0, double]) / ... / 053 / if (!false && false) { / 054 / / null / / 055 / final int value9 = -1; / 056 / isNull6 = true; / 057 / value6 = value9; / 058 / } else { ... / 077 / return mutableRow; / 078 / } / 079 / } / 080 / ``` Also, this issue fixes the following too. (Similar with [SPARK-14185](https://issues.apache.org/jira/browse/SPARK-14185)) ```java 16/03/30 12:39:24 DEBUG WholeStageCodegen: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } ``` ```java 16/03/30 12:46:32 DEBUG WholeStageCodegen: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 */ } ``` ## How was this patch tested? Pass the Jenkins tests (including new CodeFormatterSuite testcases.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12072 from dongjoon-hyun/SPARK-14282.	2016-03-30 16:15:37 -07:00
Wenchen Fan	d46c71b39d	[SPARK-14268][SQL] rename toRowExpressions and fromRowExpression to serializer and deserializer in ExpressionEncoder ## What changes were proposed in this pull request? In `ExpressionEncoder`, we use `constructorFor` to build `fromRowExpression` as the `deserializer` in `ObjectOperator`. It's kind of confusing, we should make the name consistent. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12058 from cloud-fan/rename.	2016-03-30 11:03:15 -07:00
gatorsmile	b66b97cd04	[SPARK-14124][SQL] Implement Database-related DDL Commands #### What changes were proposed in this pull request? This PR is to implement the following four Database-related DDL commands: - `CREATE DATABASE\|SCHEMA [IF NOT EXISTS] database_name` - `DROP DATABASE [IF EXISTS] database_name [RESTRICT\|CASCADE]` - `DESCRIBE DATABASE [EXTENDED] db_name` - `ALTER (DATABASE\|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)` Another PR will be submitted to handle the unsupported commands. In the Database-related DDL commands, we will issue an error exception for `ALTER (DATABASE\|SCHEMA) database_name SET OWNER [USER\|ROLE] user_or_role`. cc yhuai andrewor14 rxin Could you review the changes? Is it in the right direction? Thanks! #### How was this patch tested? Added a few test cases in `command/DDLSuite.scala` for testing DDL command execution in `SQLContext`. Since `HiveContext` also shares the same implementation, the existing test cases in `\hive` also verifies the correctness of these commands. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12009 from gatorsmile/dbDDL.	2016-03-29 17:39:52 -07:00
Sameer Agarwal	366cac6fb0	[SPARK-14225][SQL] Cap the length of toCommentSafeString at 128 chars ## What changes were proposed in this pull request? Builds on https://github.com/apache/spark/pull/12022 and (a) appends "..." to truncated comment strings and (b) fixes indentation in lines after the commented strings if they happen to have a `(`, `{`, `)` or `}` ## How was this patch tested? Manually examined the generated code. Author: Sameer Agarwal <sameer@databricks.com> Closes #12044 from sameeragarwal/comment.	2016-03-29 16:46:45 -07:00
Dongjoon Hyun	d612228eff	[MINOR][SQL] Fix typos by replacing 'much' with 'match'. ## What changes were proposed in this pull request? This PR fixes two trivial typos: 'does not much' --> 'does not match'. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12042 from dongjoon-hyun/fix_typo_by_replacing_much_with_match.	2016-03-29 12:45:43 -07:00
Herman van Hovell	27d4ef0c61	[SPARK-14213][SQL] Migrate HiveQl parsing to ANTLR4 parser ### What changes were proposed in this pull request? This PR migrates all HiveQl parsing to the new ANTLR4 parser. This PR is build on top of https://github.com/apache/spark/pull/12011, and we should wait with merging until that one is in (hence the WIP tag). As soon as this PR is merged we can start removing much of the old parser infrastructure. ### How was this patch tested? Exisiting Hive unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12015 from hvanhovell/SPARK-14213.	2016-03-28 20:19:21 -07:00
Andrew Or	27aab80695	[SPARK-14013][SQL] Proper temp function support in catalog ## What changes were proposed in this pull request? Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name. This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however. ## How was this patch tested? `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11972 from andrewor14/temp-functions.	2016-03-28 16:45:02 -07:00
Reynold Xin	b7836492bb	[SPARK-14155][SQL] Hide UserDefinedType interface in Spark 2.0 ## What changes were proposed in this pull request? UserDefinedType is a developer API in Spark 1.x. With very high probability we will create a new API for user-defined type that also works well with column batches as well as encoders (datasets). In Spark 2.0, let's make `UserDefinedType` `private[spark]` first. ## How was this patch tested? Existing unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #11955 from rxin/SPARK-14155.	2016-03-28 16:26:32 -07:00
Andrew Or	eebc8c1c95	[SPARK-13923][SPARK-14014][SQL] Session catalog follow-ups ## What changes were proposed in this pull request? This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits. ## How was this patch tested? `SessionCatalogSuite` and `CatalogTestCases` Author: Andrew Or <andrew@databricks.com> Closes #12006 from andrewor14/session-catalog-followup.	2016-03-28 16:25:15 -07:00
Yin Huai	7007f72ba7	[SPARK-13713][SQL][TEST-MAVEN] Add Antlr4 maven plugin. Seems `600c0b69ca` is missing the antlr4 maven plugin. This pr adds it. Author: Yin Huai <yhuai@databricks.com> Closes #12010 from yhuai/mavenAntlr4.	2016-03-28 13:50:42 -07:00
Herman van Hovell	600c0b69ca	[SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 ### What changes were proposed in this pull request? The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4. This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs. This PR is a work in progress, and work needs to be done in the following area's: - [x] Error handling should be improved. - [x] Documentation should be improved. - [x] Multi-Insert needs to be tested. - [ ] Naming and package locations. ### How was this patch tested? Catalyst and SQL unit tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11557 from hvanhovell/ngParser.	2016-03-28 12:31:12 -07:00
Kazuaki Ishizaki	4a7636f2da	[SPARK-13844] [SQL] Generate better code for filters with a non-nullable column ## What changes were proposed in this pull request? This PR simplifies generated code with a non-nullable column. This PR addresses three items: 1. Generate simplified code for and / or 2. Generate better code for divide and remainder with non-zero dividend 3. Pass nullable information into BoundReference at WholeStageCodegen I have attached the generated code with and without this PR ## How was this patch tested? Tested by existing test suites in sql/core Here is a motivating example ```` (0 to 6).map(i => (i.toString, i.toInt)).toDF("k", "v") .filter("v % 2 == 0").filter("v <= 4").filter("v > 1").show() ```` Generated code without this PR ````java /* 032 / protected void processNext() throws java.io.IOException { / 033 / /** PRODUCE: Project [_1#0 AS k#3,_2#1 AS v#4] / / 034 / / 035 / /** PRODUCE: Filter ((isnotnull((_2#1 % 2)) && ((_2#1 % 2) = 0)) && ((_2#1 <= 4) && (_2#1 > 1))) / / 036 / / 037 / /** PRODUCE: INPUT / / 038 / / 039 / while (!shouldStop() && inputadapter_input.hasNext()) { / 040 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 041 / /** CONSUME: Filter ((isnotnull((_2#1 % 2)) && ((_2#1 % 2) = 0)) && ((_2#1 <= 4) && (_2#1 > 1))) / / 042 / / input[1, int] / / 043 / int filter_value1 = inputadapter_row.getInt(1); / 044 / / 045 / / isnotnull((input[1, int] % 2)) / / 046 / / (input[1, int] % 2) / / 047 / boolean filter_isNull3 = false; / 048 / int filter_value3 = -1; / 049 / if (false \|\| 2 == 0) { / 050 / filter_isNull3 = true; / 051 / } else { / 052 / if (false) { / 053 / filter_isNull3 = true; / 054 / } else { / 055 / filter_value3 = (int)(filter_value1 % 2); / 056 / } / 057 / } / 058 / if (!(!(filter_isNull3))) continue; / 059 / / 060 / / ((input[1, int] % 2) = 0) / / 061 / boolean filter_isNull6 = true; / 062 / boolean filter_value6 = false; / 063 / / (input[1, int] % 2) / / 064 / boolean filter_isNull7 = false; / 065 / int filter_value7 = -1; / 066 / if (false \|\| 2 == 0) { / 067 / filter_isNull7 = true; / 068 / } else { / 069 / if (false) { / 070 / filter_isNull7 = true; / 071 / } else { / 072 / filter_value7 = (int)(filter_value1 % 2); / 073 / } / 074 / } / 075 / if (!filter_isNull7) { / 076 / filter_isNull6 = false; // resultCode could change nullability. / 077 / filter_value6 = filter_value7 == 0; / 078 / / 079 / } / 080 / if (filter_isNull6 \|\| !filter_value6) continue; / 081 / / 082 / / (input[1, int] <= 4) / / 083 / boolean filter_value11 = false; / 084 / filter_value11 = filter_value1 <= 4; / 085 / if (!filter_value11) continue; / 086 / / 087 / / (input[1, int] > 1) / / 088 / boolean filter_value14 = false; / 089 / filter_value14 = filter_value1 > 1; / 090 / if (!filter_value14) continue; / 091 / / 092 / filter_metricValue.add(1); / 093 / / 094 / /** CONSUME: Project [_1#0 AS k#3,_2#1 AS v#4] / / 095 / / 096 / / input[0, string] / / 097 / / input[0, string] / / 098 / boolean filter_isNull = inputadapter_row.isNullAt(0); / 099 / UTF8String filter_value = filter_isNull ? null : (inputadapter_row.getUTF8String(0)); / 100 / project_holder.reset(); / 101 / / 102 / project_rowWriter.zeroOutNullBytes(); / 103 / / 104 / if (filter_isNull) { / 105 / project_rowWriter.setNullAt(0); / 106 / } else { / 107 / project_rowWriter.write(0, filter_value); / 108 / } / 109 / / 110 / project_rowWriter.write(1, filter_value1); / 111 / project_result.setTotalSize(project_holder.totalSize()); / 112 / append(project_result.copy()); / 113 / } / 114 / } / 115 / } ```` Generated code with this PR ````java / 032 / protected void processNext() throws java.io.IOException { / 033 / /** PRODUCE: Project [_1#0 AS k#3,_2#1 AS v#4] / / 034 / / 035 / /** PRODUCE: Filter (((_2#1 % 2) = 0) && ((_2#1 <= 5) && (_2#1 > 1))) / / 036 / / 037 / /** PRODUCE: INPUT / / 038 / / 039 / while (!shouldStop() && inputadapter_input.hasNext()) { / 040 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 041 / /** CONSUME: Filter (((_2#1 % 2) = 0) && ((_2#1 <= 5) && (_2#1 > 1))) / / 042 / / input[1, int] / / 043 / int filter_value1 = inputadapter_row.getInt(1); / 044 / / 045 / / ((input[1, int] % 2) = 0) / / 046 / / (input[1, int] % 2) / / 047 / int filter_value3 = (int)(filter_value1 % 2); / 048 / / 049 / boolean filter_value2 = false; / 050 / filter_value2 = filter_value3 == 0; / 051 / if (!filter_value2) continue; / 052 / / 053 / / (input[1, int] <= 5) / / 054 / boolean filter_value7 = false; / 055 / filter_value7 = filter_value1 <= 5; / 056 / if (!filter_value7) continue; / 057 / / 058 / / (input[1, int] > 1) / / 059 / boolean filter_value10 = false; / 060 / filter_value10 = filter_value1 > 1; / 061 / if (!filter_value10) continue; / 062 / / 063 / filter_metricValue.add(1); / 064 / / 065 / /** CONSUME: Project [_1#0 AS k#3,_2#1 AS v#4] / / 066 / / 067 / / input[0, string] / / 068 / / input[0, string] / / 069 / boolean filter_isNull = inputadapter_row.isNullAt(0); / 070 / UTF8String filter_value = filter_isNull ? null : (inputadapter_row.getUTF8String(0)); / 071 / project_holder.reset(); / 072 / / 073 / project_rowWriter.zeroOutNullBytes(); / 074 / / 075 / if (filter_isNull) { / 076 / project_rowWriter.setNullAt(0); / 077 / } else { / 078 / project_rowWriter.write(0, filter_value); / 079 / } / 080 / / 081 / project_rowWriter.write(1, filter_value1); / 082 / project_result.setTotalSize(project_holder.totalSize()); / 083 / append(project_result.copy()); / 084 / } / 085 / } / 086 */ } ```` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #11684 from kiszk/SPARK-13844.	2016-03-28 10:35:48 -07:00
Kousuke Saruta	aac13fb48c	[SPARK-14185][SQL][MINOR] Make indentation of debug log for generated code proper ## What changes were proposed in this pull request? The indentation of debug log output by `CodeGenerator` is weird. The first line of the generated code should be put on the next line of the first line of the log message. ``` 16/03/28 11:10:24 DEBUG CodeGenerator: /* 001 / / 002 / public java.lang.Object generate(Object[] references) { / 003 / return new SpecificSafeProjection(references); ... ``` After this patch is applied, we get debug log like as follows. ``` 16/03/28 10:45:50 DEBUG CodeGenerator: / 001 / / 002 / public java.lang.Object generate(Object[] references) { / 003 */ return new SpecificSafeProjection(references); ... ``` ## How was this patch tested? Ran some jobs and checked debug logs. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #11990 from sarutak/fix-debuglog-indentation.	2016-03-27 23:50:23 -07:00
Dongjoon Hyun	1808465855	[MINOR] Fix newly added java-lint errors ## What changes were proposed in this pull request? This PR fixes some newly added java-lint errors(unused-imports, line-lengsth). ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11968 from dongjoon-hyun/SPARK-14167.	2016-03-26 11:55:49 +00:00
Sameer Agarwal	afd0debe07	[SPARK-14137] [SPARK-14150] [SQL] Infer IsNotNull constraints from non-nullable attributes ## What changes were proposed in this pull request? This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in https://github.com/apache/spark/pull/11828. ## How was this patch tested? Unit test in `ConstraintPropagationSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11953 from sameeragarwal/infer-isnotnull.	2016-03-25 12:57:26 -07:00
Liang-Chi Hsieh	ca003354da	[SPARK-12443][SQL] encoderFor should support Decimal ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-12443 `constructorFor` will call `dataTypeFor` to determine if a type is `ObjectType` or not. If there is not case for `Decimal`, it will be recognized as `ObjectType` and causes the bug. ## How was this patch tested? Test is added into `ExpressionEncoderSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10399 from viirya/fix-encoder-decimal.	2016-03-25 12:07:56 -07:00
Wenchen Fan	43b15e01c4	[SPARK-14061][SQL] implement CreateMap ## What changes were proposed in this pull request? As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API. ## How was this patch tested? various new tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11879 from cloud-fan/create_map.	2016-03-25 09:50:06 -07:00
Davies Liu	6603d9f7e2	[SPARK-13919] [SQL] fix column pruning through filter ## What changes were proposed in this pull request? This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning. The RuleExecutor will fail the test if reached max iterations. Closes #11745 ## How was this patch tested? Existing tests. This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137 Author: Davies Liu <davies@databricks.com> Closes #11828 from davies/fail_rule.	2016-03-25 09:05:23 -07:00
Wenchen Fan	e9b6e7d857	[SPARK-13456][SQL][FOLLOW-UP] lazily generate the outer pointer for case class defined in REPL ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/11410, we missed a corner case: define the inner class and use it in `Dataset` at the same time by using paste mode. For this case, the inner class and the `Dataset` are inside same line object, when we build the `Dataset`, we try to get outer pointer from line object, and it will fail because the line object is not initialized yet. https://issues.apache.org/jira/browse/SPARK-13456?focusedCommentId=15209174&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209174 is an example for this corner case. This PR make the process of getting outer pointer from line object lazy, so that we can successfully build the `Dataset` and finish initializing the line object. ## How was this patch tested? new test in repl suite. Author: Wenchen Fan <wenchen@databricks.com> Closes #11931 from cloud-fan/repl.	2016-03-25 20:19:04 +08:00
Andrew Or	20ddf5fddf	[SPARK-14014][SQL] Integrate session catalog (attempt #2 ) ## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.	2016-03-24 22:59:35 -07:00
Reynold Xin	3619fec1ec	[SPARK-14142][SQL] Replace internal use of unionAll with union ## What changes were proposed in this pull request? unionAll has been deprecated in SPARK-14088. ## How was this patch tested? Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11946 from rxin/SPARK-14142.	2016-03-24 22:34:55 -07:00
gatorsmile	05f652d6c2	[SPARK-13957][SQL] Support Group By Ordinal in SQL #### What changes were proposed in this pull request? This PR is to support group by position in SQL. For example, when users input the following query ```SQL select c1 as a, c2, c3, sum() from tbl group by 1, 3, c4 ``` The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to ```SQL select c1, c2, c3, sum() from tbl group by c1, c3, c4 ``` This is controlled by the config option `spark.sql.groupByOrdinal`. - When true, the ordinal numbers in group by clauses are treated as the position in the select list. - When false, the ordinal numbers are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them. - When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message. - star is not allowed to use in the select list when users specify ordinals in group by Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11846 from gatorsmile/groupByOrdinal.	2016-03-25 12:55:58 +08:00
Andrew Or	c44d140cae	Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog" This reverts commit `5dfc01976b`.	2016-03-23 22:21:15 -07:00
gatorsmile	f42eaf42bd	[SPARK-14085][SQL] Star Expansion for Hash #### What changes were proposed in this pull request? This PR is to support star expansion in hash. For example, ```SQL val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"") ``` In addition, it refactors the codes for the rule `ResolveStar` and fixes a regression for star expansion in group by when using SQL API. For example, ```SQL SELECT FROM testData2 group by a, b ``` cc cloud-fan Now, the code for star resolution is much cleaner. The coverage is better. Could you check if this refactoring is good? Thanks! #### How was this patch tested? Added a few test cases to cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #11904 from gatorsmile/starResolution.	2016-03-24 11:13:36 +08:00
Andrew Or	5dfc01976b	[SPARK-14014][SQL] Replace existing catalog with SessionCatalog ## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.	2016-03-23 13:34:22 -07:00
Herman van Hovell	919bf32198	[SPARK-13325][SQL] Create a 64-bit hashcode expression This PR introduces a 64-bit hashcode expression. Such an expression is especially usefull for HyperLogLog++ and other probabilistic datastructures. I have implemented xxHash64 which is a 64-bit hashing algorithm created by Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs at memory bandwidth) and high quality hashcode. It exploits both Instruction Level Parralellism (for speed) and the multiplication and rotation techniques (for quality) like MurMurHash does. The initial results are promising. I have added a CG'ed test to the `HashBenchmark`, and this results in the following results (running from SBT): Running benchmark: Hash For simple Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For simple: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 1011 / 1016 132.8 7.5 1.0X codegen version 1864 / 1869 72.0 13.9 0.5X codegen version 64-bit 1614 / 1644 83.2 12.0 0.6X Running benchmark: Hash For normal Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For normal: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 2467 / 2475 0.9 1176.1 1.0X codegen version 2008 / 2115 1.0 957.5 1.2X codegen version 64-bit 728 / 758 2.9 347.0 3.4X Running benchmark: Hash For array Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 1544 / 1707 0.1 11779.6 1.0X codegen version 2728 / 2745 0.0 20815.5 0.6X codegen version 64-bit 2508 / 2549 0.1 19132.8 0.6X Running benchmark: Hash For map Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 1819 / 1826 0.0 444014.3 1.0X codegen version 183 / 194 0.0 44642.9 9.9X codegen version 64-bit 173 / 174 0.0 42120.9 10.5X This shows that algorithm is consistently faster than MurMurHash32 in all cases and up to 3x (!) in the normal case. I have also added this to HyperLogLog++ and it cuts the processing time of the following code in half: val df = sqlContext.range(1<<25).agg(approxCountDistinct("id")) df.explain() val t = System.nanoTime() df.show() val ns = System.nanoTime() - t // Before ns: Long = 5821524302 // After ns: Long = 2836418963 cc cloud-fan (you have been working on hashcodes) / rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11209 from hvanhovell/xxHash.	2016-03-23 20:51:01 +01:00
Josh Rosen	3de24ae2ed	[SPARK-14075] Refactor MemoryStore to be testable independent of BlockManager This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`. - The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`. - `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`. - The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests. - Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.	2016-03-23 10:15:23 -07:00
gatorsmile	6ce008ba46	[SPARK-13549][SQL] Refactor the Optimizer Rule CollapseProject #### What changes were proposed in this pull request? The PR https://github.com/apache/spark/pull/10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change. #### How was this patch tested? Added a new test case. liancheng Could you check if the code refactoring is fine? Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #11427 from gatorsmile/collapseProjectRefactor.	2016-03-24 00:51:31 +08:00
Dongjoon Hyun	1a22cf1e9b	[MINOR][SQL][DOCS] Update `sql/README.md` and remove some unused imports in `sql` module. ## What changes were proposed in this pull request? This PR updates `sql/README.md` according to the latest console output and removes some unused imports in `sql` module. This is done by manually, so there is no guarantee to remove all unused imports. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11907 from dongjoon-hyun/update_sql_module.	2016-03-22 23:07:49 -07:00
Davies Liu	4700adb98e	[SPARK-13806] [SQL] fix rounding mode of negative float/double ## What changes were proposed in this pull request? Round() in database usually round the number up (away from zero), it's different than Math.round() in Java. For example: ``` scala> java.lang.Math.round(-3.5) res3: Long = -3 ``` In Database, we should return -4.0 in this cases. This PR remove the buggy special case for scale=0. ## How was this patch tested? Add tests for negative values with tie. Author: Davies Liu <davies@databricks.com> Closes #11894 from davies/fix_round.	2016-03-22 16:45:20 -07:00
Dongjoon Hyun	c632bdc01f	[SPARK-14029][SQL] Improve BooleanSimplification optimization by implementing `Not` canonicalization. ## What changes were proposed in this pull request? Currently, BooleanSimplification optimization can handle the following cases. * a && (!a \|\| b ) ==> a && b * a && (b \|\| !a ) ==> a && b However, it can not handle the followings cases since those equations fail at the comparisons between their canonicalized forms. * a < 1 && (!(a < 1) \|\| b) ==> (a < 1) && b * a <= 1 && (!(a <= 1) \|\| b) ==> (a <= 1) && b * a > 1 && (!(a > 1) \|\| b) ==> (a > 1) && b * a >= 1 && (!(a >= 1) \|\| b) ==> (a >= 1) && b This PR implements the above cases and also the followings, too. * a < 1 && ((a >= 1) \|\| b ) ==> (a < 1) && b * a <= 1 && ((a > 1) \|\| b ) ==> (a <= 1) && b * a > 1 && ((a <= 1) \|\| b) ==> (a > 1) && b * a >= 1 && ((a < 1) \|\| b) ==> (a >= 1) && b ## How was this patch tested? Pass the Jenkins tests including new test cases in BooleanSimplicationSuite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11851 from dongjoon-hyun/SPARK-14029.	2016-03-22 10:17:08 -07:00
Cheng Lian	f2e855fba8	[SPARK-13473][SQL] Simplifies PushPredicateThroughProject ## What changes were proposed in this pull request? This is a follow-up of PR #11348. After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11864 from liancheng/spark-13473-cleanup.	2016-03-22 19:20:56 +08:00
gatorsmile	3f49e0766f	[SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star This PR resolves two issues: First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example, ```scala structDf.groupBy($"a").agg(min(struct($"record."))) ``` Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example, ```scala pagecounts4PartitionsDS .map(line => (line._1, line._3)) .toDF() .groupBy($"_1") .agg(sum("") as "sumOccurances") ``` Before the fix, the invalid usage will issue a confusing error message, like: ``` org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2; ``` After the fix, the message is like: ``` org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum' ``` cc: rxin nongli cloud-fan Author: gatorsmile <gatorsmile@gmail.com> Closes #11208 from gatorsmile/sumDataSetResolution.	2016-03-22 08:21:02 +08:00
Wenchen Fan	f3717fc7c9	[SPARK-14004][FOLLOW-UP] Implementations of NonSQLExpression should not override sql method ## What changes were proposed in this pull request? There is only one exception: `PythonUDF`. However, I don't think the `PythonUDF#` prefix is useful, as we can only create python udf under python context. This PR removes the `PythonUDF#` prefix from `PythonUDF.toString`, so that it doesn't need to overrde `sql`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11859 from cloud-fan/tmp.	2016-03-21 15:24:18 -07:00
Cheng Lian	5d8de16e71	[SPARK-14004][SQL] NamedExpressions should have at most one qualifier ## What changes were proposed in this pull request? This is a more aggressive version of PR #11820, which not only fixes the original problem, but also does the following updates to enforce the at-most-one-qualifier constraint: - Renames `NamedExpression.qualifiers` to `NamedExpression.qualifier` - Uses `Option[String]` rather than `Seq[String]` for `NamedExpression.qualifier` Quoted PR description of #11820 here: > Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. ## How was this patch tested? Existing tests should be enough. Author: Cheng Lian <lian@databricks.com> Closes #11822 from liancheng/spark-14004-aggressive.	2016-03-21 11:00:09 -07:00
Wenchen Fan	43ebf7a9cb	[SPARK-13456][SQL] fix creating encoders for case classes defined in Spark shell ## What changes were proposed in this pull request? case classes defined in REPL are wrapped by line classes, and we have a trick for scala 2.10 REPL to automatically register the wrapper classes to `OuterScope` so that we can use when create encoders. However, this trick doesn't work right after we upgrade to scala 2.11, and unfortunately the tests are only in scala 2.10, which makes this bug hidden until now. This PR moves the encoder tests to scala 2.11 `ReplSuite`, and fixes this bug by another approach(the previous trick can't port to scala 2.11 REPL): make `OuterScope` smarter that can detect classes defined in REPL and load the singleton of line wrapper classes automatically. ## How was this patch tested? the migrated encoder tests in `ReplSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11410 from cloud-fan/repl.	2016-03-21 10:37:24 -07:00
Wenchen Fan	17a3f00676	[SPARK-14000][SQL] case class with a tuple field can't work in Dataset ## What changes were proposed in this pull request? When we validate an encoder, we may call `dataType` on unresolved expressions. This PR fix the validation so that we will resolve attributes first. ## How was this patch tested? a new test in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11816 from cloud-fan/encoder.	2016-03-21 22:22:15 +08:00
gatorsmile	2c5b18fb0f	[SPARK-12789][SQL] Support Order By Ordinal in SQL #### What changes were proposed in this pull request? This PR is to support order by position in SQL, e.g. ```SQL select c1, c2, c3 from tbl order by 1 desc, 3 ``` should be equivalent to ```SQL select c1, c2, c3 from tbl order by c1 desc, c3 asc ``` This is controlled by config option `spark.sql.orderByOrdinal`. - When true, the ordinal numbers are treated as the position in the select list. - When false, the ordinal number in order/sort By clause are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them - This also works with select . Question: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell -- Update: In these cases, they are ignored in this case. Note*: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #11815 from gatorsmile/orderByPosition.	2016-03-21 18:08:41 +08:00
Dongjoon Hyun	20fd254101	[SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle rule ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables LineLength checkstyle again. To help that, this also introduces RedundantImport and RedundantModifier, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.\|^import.\|a href\|href\|http://\|https://\|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.	2016-03-21 07:58:57 +00:00

... 5 6 7 8 9 ...

2002 commits