ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yin Huai	238fb485be	[SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a converted SQL query, we eagerly trigger analysis ## What changes were proposed in this pull request? As part of testing generating SQL query from a analyzed SQL plan, we run the generated SQL for tests in HiveComparisonTest. This PR makes the generated SQL get eagerly analyzed. So, when a generated SQL has any analysis error, we can see the error message created by ``` case NonFatal(e) => fail( s"""Failed to analyze the converted SQL string: \| \|# Original HiveQL query string: \|$queryString \| \|# Resolved query plan: \|${originalQuery.analyzed.treeString} \| \|# Converted SQL query string: \|$convertedSQL """.stripMargin, e) ``` Right now, if we can parse a generated SQL but fail to analyze it, we will see error message generated by the following code (it only mentions that we cannot execute the original query, i.e. `queryString`). ``` case e: Throwable => val errorMessage = s""" \|Failed to execute query using catalyst: \|Error: ${e.getMessage} \|${stackTraceToString(e)} \|$queryString \|$query \|== HIVE - ${hive.size} row(s) == \|${hive.mkString("\n")} """.stripMargin ``` ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #11825 from yhuai/SPARK-13972-follow-up.	2016-03-18 13:40:53 -07:00
Davies Liu	9c23c818ca	[SPARK-13977] [SQL] Brings back Shuffled hash join ## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast. ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory. This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side). A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin. ## How was this patch tested? Added new unit tests for ShuffledHashJoin. Author: Davies Liu <davies@databricks.com> Closes #11788 from davies/shuffle_join.	2016-03-18 10:32:53 -07:00
Wenchen Fan	0acb32a3f1	[SPARK-13972][SQ] hive tests should fail if SQL generation failed ## What changes were proposed in this pull request? Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log. We will send new PRs for spotted bugs, and merge this one after all bugs are fixed. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11782 from cloud-fan/test.	2016-03-18 23:16:14 +08:00
Wenchen Fan	0f1015ffdd	[SPARK-14001][SQL] support multi-children Union in SQLBuilder ## What changes were proposed in this pull request? The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string. ## How was this patch tested? The re-enabled test Author: Wenchen Fan <wenchen@databricks.com> Closes #11818 from cloud-fan/bug-fix.	2016-03-18 19:42:33 +08:00
tedyu	90a1d8db70	[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10 PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure. This PR fixes this issue by expanding this pattern match into several simpler ones. Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case. Compilation against Scala 2.10 Author: tedyu <yuzhihong@gmail.com> Closes #11798 from yy2016/master.	2016-03-18 12:11:32 +08:00
Wenchen Fan	6037ed0a1d	[SPARK-13976][SQL] do not remove sub-queries added by user when generate SQL ## What changes were proposed in this pull request? We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation. an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1` before this PR, we will generate: ``` SELECT attr_1 AS a FROM SELECT attr_1 FROM ( SELECT a AS attr_1 FROM tbl ) AS sub_q0 WHERE attr_1 = 1 ``` We missed a sub-query and this SQL string is illegal. After this PR, we will generate: ``` SELECT attr_1 AS a FROM ( SELECT attr_1 FROM ( SELECT a AS attr_1 FROM tbl ) AS sub_q0 WHERE attr_1 = 1 ) AS t ``` TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string. ## How was this patch tested? `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11786 from cloud-fan/bug-fix.	2016-03-18 10:16:48 +08:00
Wenchen Fan	453455c479	[SPARK-13974][SQL] sub-query names do not need to be globally unique while generate SQL ## What changes were proposed in this pull request? We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`. also addressed 2 minor comments in https://github.com/apache/spark/pull/11696 ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11783 from cloud-fan/tmp.	2016-03-18 09:30:36 +08:00
Yin Huai	4c08e2c085	Revert "[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10" This reverts commit `3ee7996187`.	2016-03-17 11:16:03 -07:00
tedyu	3ee7996187	[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10 ## What changes were proposed in this pull request? Compilation against Scala 2.10 fails with: ``` [error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for unreachability. [error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.) [error] [warn] private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match { ``` ## How was this patch tested? Compilation against Scala 2.10 Author: tedyu <yuzhihong@gmail.com> Closes #11787 from yy2016/master.	2016-03-17 10:09:37 -07:00
Wenchen Fan	1974d1d34d	[SPARK-12719][SQL] SQL generation support for Generate ## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11696 from cloud-fan/generate.	2016-03-17 20:25:05 +08:00
Wenchen Fan	8ef3399aff	[SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.	2016-03-17 19:23:38 +08:00
Ryan Blue	5faba9facc	[SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors. This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf. I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)). Author: Ryan Blue <blue@apache.org> Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.	2016-03-16 22:57:06 -07:00
Daoyuan Wang	d1c193a2f1	[SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test ## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.	2016-03-16 22:52:10 -07:00
Andrew Or	ca9ef86c84	[SPARK-13923][SQL] Implement SessionCatalog ## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11750 from andrewor14/temp-catalog.	2016-03-16 18:02:43 -07:00
gatorsmile	c4bd57602c	[SPARK-12721][SQL] SQL Generation for Script Transformation #### What changes were proposed in this pull request? This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`. For example, below is the SQL containing `Transform` ``` SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2 ``` Its logical plan is like ``` ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true) +- SubqueryAlias parquet_t2 +- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation ``` The generated SQL will be like ``` SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2` ``` #### How was this patch tested? Seven test cases are added to `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11503 from gatorsmile/transformToSQL.	2016-03-16 13:11:11 -07:00
Wenchen Fan	1d1de28a3c	[SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #11658 from cloud-fan/gensql.	2016-03-16 11:57:28 -07:00
Cheng Hao	d9670f8473	[SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`. ## How was this patch tested? No additional unit test required. Author: Cheng Hao <hao.cheng@intel.com> Closes #11730 from chenghao-intel/range.	2016-03-16 11:20:15 -07:00
Wenchen Fan	d9e8f26d03	[SPARK-13924][SQL] officially support multi-insert ## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1 SELECT key + 1 INSERT OVERWRITE TABLE dest2 SELECT key WHERE key > 2 INSERT OVERWRITE TABLE dest3 SELECT col EXPLODE(arr) exp AS col ... ``` We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE. This PR removes these limitations and make us fully support multi-insert. ## How was this patch tested? new tests in `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11754 from cloud-fan/lateral-view.	2016-03-16 10:52:36 -07:00
Sean Owen	3b461d9ecd	[SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up ## What changes were proposed in this pull request? Follow up to https://github.com/apache/spark/pull/11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.	2016-03-16 09:36:34 +00:00
Dongjoon Hyun	431a3d04b4	[SPARK-12653][SQL] Re-enable test "SPARK-8489: MissingRequirementError during reflection" ## What changes were proposed in this pull request? The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test. Historically, the target regression test is added by [SPARK-8498](`093c34838d`), but is temporarily disabled by [SPARK-12615](`8ce645d4ee`) due to binary compatibility error. The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test. ``` Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map; ``` Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10. ``` Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; ``` This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11744 from dongjoon-hyun/SPARK-12653.	2016-03-16 09:05:53 +00:00
Davies Liu	421f6c20e8	[SPARK-13917] [SQL] generate broadcast semi join ## What changes were proposed in this pull request? This PR brings codegen support for broadcast left-semi join. ## How was this patch tested? Existing tests. Added benchmark, the result show 7X speedup. Author: Davies Liu <davies@databricks.com> Closes #11742 from davies/gen_semi.	2016-03-15 22:17:04 -07:00
Stavros Kontopoulos	50e3644d00	[SPARK-13896][SQL][STRING] Dataset.toJSON should return Dataset ## What changes were proposed in this pull request? Change the return type of toJson in Dataset class ## How was this patch tested? No additional unit test required. Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com> Closes #11732 from skonto/fix_toJson.	2016-03-15 12:18:30 -07:00
Reynold Xin	5e6f2f4563	[SPARK-13893][SQL] Remove SQLContext.catalog/analyzer (internal method) ## What changes were proposed in this pull request? Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits: 1. Reduces internal dependency on SQLContext. 2. Removes 2 public methods in Java (Java does not obey package private visibility). More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field. ## How was this patch tested? Existing unit/integration test code. Author: Reynold Xin <rxin@databricks.com> Closes #11716 from rxin/SPARK-13893.	2016-03-15 10:12:32 -07:00
Reynold Xin	276c2d51a3	[SPARK-13890][SQL] Remove some internal classes' dependency on SQLContext ## What changes were proposed in this pull request? In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer. As part of this patch, I also removed the following internal methods from SQLContext: ``` protected[sql] def functionRegistry: FunctionRegistry protected[sql] def optimizer: Optimizer protected[sql] def sqlParser: ParserInterface protected[sql] def planner: SparkPlanner protected[sql] def continuousQueryManager protected[sql] def prepareForExecution: RuleExecutor[SparkPlan] ``` ## How was this patch tested? Existing unit/integration tests. Author: Reynold Xin <rxin@databricks.com> Closes #11712 from rxin/sqlContext-planner.	2016-03-14 23:58:57 -07:00
Reynold Xin	e64958001c	[SPARK-13884][SQL] Remove DescribeCommand's dependency on LogicalPlan ## What changes were proposed in this pull request? This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner). ## How was this patch tested? Should be covered by existing unit tests and Hive compatibility tests that run describe table. Author: Reynold Xin <rxin@databricks.com> Closes #11710 from rxin/SPARK-13884.	2016-03-14 23:09:10 -07:00
Michael Armbrust	17eec0a71b	[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.	2016-03-14 19:21:12 -07:00
Marcelo Vanzin	8301fadd8d	[SPARK-13626][CORE] Avoid duplicate config deprecation warnings. Three different things were needed to get rid of spurious warnings: - silence deprecation warnings when cloning configuration - change the way SparkHadoopUtil instantiates SparkConf to silence warnings - avoid creating new SparkConf instances where it's not needed. On top of that, I changed the way that Logging.scala detects the repl; now it uses a method that is overridden in the repl's Main class, and the hack in Utils.scala is not needed anymore. This makes the 2.11 repl behave like the 2.10 one and set the default log level to WARN, which is a lot better. Previously, this wasn't working because the 2.11 repl triggers log initialization earlier than the 2.10 one. I also removed and simplified some other code in the 2.11 repl's Main to avoid replicating logic that already exists elsewhere in Spark. Tested the 2.11 repl in local and yarn modes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11510 from vanzin/SPARK-13626.	2016-03-14 14:27:33 -07:00
Dongjoon Hyun	acdf219703	[MINOR][DOCS] Fix more typos in comments/strings. ## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.	2016-03-14 09:07:39 +00:00
Sean Owen	1840852841	[SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit `1deecd8d9c` ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.	2016-03-13 21:03:49 -07:00
Davies Liu	ba8c86d06f	[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes #11514 from davies/existing_rdd.	2016-03-12 00:48:36 -08:00
Andrew Or	66d9d0edfe	[SPARK-13139][SQL] Parse Hive DDL commands ourselves ## What changes were proposed in this pull request? This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <andrew@databricks.com> Closes #11573 from andrewor14/parser-plus-plus.	2016-03-11 15:13:48 -08:00
Cheng Lian	6d37e1eb90	[SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame ## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <lian@databricks.com> Closes #11656 from liancheng/re-enable-mima.	2016-03-11 22:17:50 +08:00
Wenchen Fan	74c4e2651f	[HOT-FIX] fix compile Fix the compilation failure introduced by https://github.com/apache/spark/pull/11555 because of a merge conflict. Author: Wenchen Fan <wenchen@databricks.com> Closes #11648 from cloud-fan/hotbug.	2016-03-11 13:52:11 +08:00
Wenchen Fan	6871cc8f3e	[SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions ## What changes were proposed in this pull request? Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc. This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up. Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here. Many thanks to gatorsmile for the initial discussion and test cases! ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11555 from cloud-fan/window.	2016-03-11 13:22:34 +08:00
Cheng Lian	1d542785b9	[SPARK-13244][SQL] Migrates DataFrame to Dataset ## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.	2016-03-10 17:00:17 -08:00
Dongjoon Hyun	91fed8e9c5	[SPARK-3854][BUILD] Scala style: require spaces before `{`. ## What changes were proposed in this pull request? Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time. ``` // Correct: if (true) { println("Wow!") } // Incorrect: if (true){ println("Wow!") } ``` IntelliJ also shows new warnings based on this. ## How was this patch tested? Pass the Jenkins ScalaStyle test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11637 from dongjoon-hyun/SPARK-3854.	2016-03-10 15:57:22 -08:00
Sameer Agarwal	19f4ac6dc7	[SPARK-13759][SQL] Add IsNotNull constraints for expressions with an inequality ## What changes were proposed in this pull request? This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null. ## How was this patch tested? 1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality. 2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input. cc nongli Author: Sameer Agarwal <sameer@databricks.com> Closes #11594 from sameeragarwal/isnotequal-constraints.	2016-03-10 12:16:46 -08:00
hyukjinkwon	aa0eba2c35	[SPARK-13766][SQL] Consistent file extensions for files written by internal data sources ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13766 This PR makes the file extensions (written by internal datasource) consistent. Before - TEXT, CSV and JSON ``` [.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` .orc ``` After - TEXT, CSV and JSON ``` .txt[.COMPRESSION_CODEC_NAME] .csv[.COMPRESSION_CODEC_NAME] .json[.COMPRESSION_CODEC_NAME] ``` - Parquet ``` [.COMPRESSION_CODEC_NAME].parquet ``` - ORC ``` [.COMPRESSION_CODEC_NAME].orc ``` When the compression codec is set, - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end. - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension. ## How was this patch tested? Unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11604 from HyukjinKwon/SPARK-13766.	2016-03-09 19:12:46 -08:00
hyukjinkwon	cad29a40b2	[SPARK-13728][SQL] Fix ORC PPD test so that pushed filters can be checked. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13728 https://github.com/apache/spark/pull/11509 makes the output only single ORC file. It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters. So, this PR simply repartitions data into 10 so that the test could pass. ## How was this patch tested? unittest and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11593 from HyukjinKwon/SPARK-13728.	2016-03-09 10:48:53 -08:00
Dilip Biswal	53ba6d6e59	[SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Generate ## What changes were proposed in this pull request? Analysis exception occurs while running the following query. ``` SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints` ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7 'Project ['ints] +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8] +- SubqueryAlias nestedarray +- LocalRelation [a#0], [[[[1,2,3]]]] ``` ## How was this patch tested? Added new unit tests in SQLQuerySuite and HiveQlSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11538 from dilipbiswal/SPARK-13698.	2016-03-09 21:49:37 +08:00
Dongjoon Hyun	c3689bc24e	[SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. ## What changes were proposed in this pull request? In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator. ``` - final ArrayList<Product2<Object, Object>> dataToWrite = - new ArrayList<Product2<Object, Object>>(); + final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>(); ``` Java 7 or higher supports diamond operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this. ## How was this patch tested? Manual. Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11541 from dongjoon-hyun/SPARK-13702.	2016-03-09 10:31:26 +00:00
Jakob Odersky	035d3acdf3	[SPARK-7286][SQL] Deprecate !== in favour of =!= This PR replaces #9925 which had issues with CI. Please see the original PR for any previous discussions. ## What changes were proposed in this pull request? Deprecate the SparkSQL column operator !== and use =!= as an alternative. Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===). ## How was this patch tested? All currently existing tests. Author: Jakob Odersky <jodersky@gmail.com> Closes #11588 from jodersky/SPARK-7286.	2016-03-08 18:11:09 -08:00
Michael Armbrust	1e28840594	[SPARK-13738][SQL] Cleanup Data Source resolution Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes #11572 from marmbrus/dataSourceResolution.	2016-03-08 15:19:26 -08:00
Dongjoon Hyun	076009b949	[SPARK-13400] Stop using deprecated Octal escape literals ## What changes were proposed in this pull request? This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines. ``` LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead. HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead. ``` ## How was this patch tested? Manual. During building, there should be no warning on `Octal escape literals`. ``` mvn -DskipTests clean install ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11584 from dongjoon-hyun/SPARK-13400.	2016-03-08 15:00:26 -08:00
Wenchen Fan	46881b4ea2	[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct ## What changes were proposed in this pull request? This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer. More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11579 from cloud-fan/distinct.	2016-03-08 11:45:08 -08:00
Andrew Or	da7bfac488	[SPARK-13689][SQL] Move helper things in CatalystQl to new utils object ## What changes were proposed in this pull request? When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller. ## How was this patch tested? No change in functionality, so just Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11529 from andrewor14/parser-utils.	2016-03-07 18:01:27 -08:00
Tim Preece	46f25c2413	[SPARK-13648] Add Hive Cli to classes for isolated classloader ## What changes were proposed in this pull request? Adding the hive-cli classes to the classloader ## How was this patch tested? The hive Versionssuite tests were run This is my original work and I license the work to the project under the project's open source license. Author: Tim Preece <tim.preece.in.oz@gmail.com> Closes #11495 from preecet/master.	2016-03-07 15:23:07 -08:00
Michael Armbrust	e720dda42e	[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation `HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11509 from marmbrus/fileDataSource.	2016-03-07 15:15:10 -08:00
Sameer Agarwal	ef77003178	[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints ## What changes were proposed in this pull request? This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints. Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins. ## How was this patch tested? 1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules. 2. Test generated ExpressionTrees via `OrcFilterSuite` 3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite` cc yhuai nongli Author: Sameer Agarwal <sameer@databricks.com> Closes #11372 from sameeragarwal/gen-isnotnull.	2016-03-07 12:04:59 -08:00
Dilip Biswal	d7eac9d795	[SPARK-13651] Generator outputs are not resolved correctly resulting in run time error ## What changes were proposed in this pull request? ``` Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src") sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value") ``` Results in following logical plan ``` Project [key#2,value#3] +- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3] +- SubqueryAlias src +- Project [_1#0 AS key#2,_2#1 AS value#3] +- LocalRelation [_1#0,_2#1], [[id1,value1]] ``` The above query fails with following runtime error. ``` java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) <stack-trace omitted.....> ``` In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548 ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added unit tests in hive/SQLQuerySuite and AnalysisSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11497 from dilipbiswal/spark-13651.	2016-03-07 09:46:28 -08:00

1 2 3 4 5 ...

1003 commits