ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
David Vogelbacher	034cb139a1	[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well. However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side. This PR fixes this by also sending the partition order if there are no partitions present. ## How was this patch tested? New unit test added. Closes #24650 from dvogelbacher/dv/fixNoPartitionArrowConversion. Authored-by: David Vogelbacher <dvogelbacher@palantir.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 13:21:26 +09:00
Wenchen Fan	03c9e8adee	[SPARK-24586][SQL] Upcast should not allow casting from string to other types ## What changes were proposed in this pull request? When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc. As a result, `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter. This PR does 2 things: 1. rename `Cast.canSafeCast` to `Cast.canUpcast`, and support complex typres 2. remove `Cast.mayTruncate` and replace it with `!Cast.canUpcast` Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here. ## How was this patch tested? new tests Closes #21586 from cloud-fan/cast. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:35:51 +08:00
Gengliang Wang	c3c443ca8c	[SPARK-27698][SQL] Add new method `convertibleFilters` for getting pushed down filters in Parquet file reader ## What changes were proposed in this pull request? To return accurate pushed filters in Parquet file scan(https://github.com/apache/spark/pull/24327#pullrequestreview-234775673), we can process the original data source filters in the following way: 1. For "And" operators, split the conjunctive predicates and try converting each of them. After that 1.1 if partially predicate pushed down is allowed, return convertible results; 1.2 otherwise, return the whole predicate if convertible, or empty result if not convertible. 2. For "Or" operators, if both children can be pushed down, it is partially or totally convertible; otherwise, return empty result 3. For other operators, they are not able to be partially pushed down. 2.1 if the entire predicate is convertible, return itself 2.2 otherwise, return an empty result. This PR also contains code refactoring. Currently `ParquetFilters. createFilter ` accepts parameter `schema: MessageType` and create field mapping for every input filter. We can make it a class member and avoid creating the `nameToParquetField` mapping for every input filter. ## How was this patch tested? Unit test Closes #24597 from gengliangwang/refactorParquetFilters. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:27:25 +08:00
Yuming Wang	6cd1efd0ae	[SPARK-27737][SQL] Upgrade to Hive 2.3.5 for Hive Metastore Client and Hadoop-3.2 profile ## What changes were proposed in this pull request? This PR aims to upgrade to Hive 2.3.5 for Hive Metastore Client and Hadoop-3.2 profile. Release Notes - Hive - Version 2.3.5 - [[HIVE-21536](https://issues.apache.org/jira/browse/HIVE-21536)] - Backport HIVE-17764 to branch-2.3 - [[HIVE-21585](https://issues.apache.org/jira/browse/HIVE-21585)] - Upgrade branch-2.3 to ORC 1.3.4 - [[HIVE-21639](https://issues.apache.org/jira/browse/HIVE-21639)] - Spark test failed since HIVE-10632 - [[HIVE-21680](https://issues.apache.org/jira/browse/HIVE-21680)] - Backport HIVE-17644 to branch-2 and branch-2.3 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345394&styleName=Text&projectId=12310843 ## How was this patch tested? This PR is tested in two ways. - Pass the Jenkins with the default configuration for `Hive Metastore Client` testing. - Pass the Jenkins with `test-hadoop3.2` configuration for `Hadoop 3.2` testing. Closes #24620 from wangyum/SPARK-27737. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:24:17 +09:00
williamwong	8442d94fb1	[SPARK-27248][SQL] `refreshTable` should recreate cache with same cache name and storage level If we refresh a cached table, the table cache will be first uncached and then recache (lazily). Currently, the logic is embedded in CatalogImpl.refreshTable method. The current implementation does not preserve the cache name and storage level. As a result, cache name and cache level could be changed after a REFERSH. IMHO, it is not what a user would expect. I would like to fix this behavior by first save the cache name and storage level for recaching the table. Two unit tests are added to make sure cache name is unchanged upon table refresh. Before applying this patch, the test created for qualified case would fail. Closes #24221 from William1104/feature/SPARK-27248. Lead-authored-by: williamwong <william1104@gmail.com> Co-authored-by: William Wong <william1104@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 11:37:16 -07:00
Liang-Chi Hsieh	c033a3e1e6	[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans ## What changes were proposed in this pull request? Because a temporary view is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain(true) ``` Before: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#2L] +- SubqueryAlias `tmp001` +- Project [id#2L] +- SubqueryAlias `test2` +- Range (0, 5, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 5, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#0L] +- SubqueryAlias `tmp001` +- Project [id#0L] +- SubqueryAlias `test` +- Range (0, 10, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` Previous PR to this issue has a regression when to explain an explain statement, like `sql("explain select 1").explain(true)`. This new fix is following up with hvanhovell's advice at https://github.com/apache/spark/pull/24464#issuecomment-494165538. Explain an explain: ```scala scala> sql("explain select 1").explain(true) == Parsed Logical Plan == ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Analyzed Logical Plan == plan: string ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Optimized Logical Plan == ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Physical Plan == Execute ExplainCommand +- ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false ``` Btw, I found there is a regression after applying hvanhovell's advice: ```scala spark.readStream .format("org.apache.spark.sql.streaming.test") .load() .explain(true) ``` ```scala == Parsed Logical Plan == StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession3e8c7175,org.apache.spark.sql.streaming.test,List(),None,List(),None,Map(),None ), dummySource, [a#559] == Analyzed Logical Plan == a: int StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession3e8c7175,org.apache.spark.sql.streaming.test,List(),None,List(),None,Map(),Non$ ), dummySource, [a#559] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; dummySource == Physical Plan == org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; dummySource ``` So I did a change to that to fix it too. ## How was this patch tested? Added test and manually test. Closes #24654 from viirya/SPARK-27439-3. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-21 11:27:05 -07:00
Wenchen Fan	1e0facb60d	[SQL][DOC][MINOR] update documents for Table and WriteBuilder ## What changes were proposed in this pull request? Update the docs to reflect the changes made by https://github.com/apache/spark/pull/24129 ## How was this patch tested? N/A Closes #24658 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 09:29:06 -07:00
Josh Rosen	604aa1b045	[SPARK-27786][SQL] Fix Sha1, Md5, and Base64 codegen when commons-codec is shaded ## What changes were proposed in this pull request? When running a custom build of Spark which shades `commons-codec`, the `Sha1` expression generates code which fails to compile: ``` org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 93: A method named "sha1Hex" is not declared in any enclosing class nor any supertype, nor through a static import ``` This is caused by an interaction between Spark's code generator and the shading: the current codegen template includes the string `org.apache.commons.codec.digest.DigestUtils.sha1Hex` as part of a larger string literal, preventing JarJarLinks from being able to replace the class name with the shaded class's name. As a result, the generated code still references the original unshaded class name name, triggering an error in case the original unshaded dependency isn't on the path. This problem impacts the `Sha1`, `Md5`, and `Base64` expressions. To fix this problem and allow for proper shading, this PR updates the codegen templates to replace the hardcoded class names with `${classof[<name>].getName}` calls. ## How was this patch tested? Existing tests. To ensure that I found all occurrences of this problem, I used IntelliJ's "Find in Path" to search for lines matching the regex `^(?!import\|package).*(org\|com\|net\|io)\.(?!apache\.spark)` and then filtered matches to inspect only non-test "Usage in string constants" cases. This isn't _perfect_ but I think it'll catch most cases. Closes #24655 from JoshRosen/fix-shaded-apache-commons. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-21 21:18:34 +08:00
Dongjoon Hyun	039db879f4	Revert "[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans" This reverts commit `4b725e50a7`.	2019-05-20 15:07:00 -07:00
Wenchen Fan	0e6601acdf	[SPARK-27747][SQL] add a logical plan link in the physical plan ## What changes were proposed in this pull request? It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in https://github.com/apache/spark/pull/24389 This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied. The planner leverages this feature to put the logical plan into the physical plan. ## How was this patch tested? a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans. Closes #24626 from cloud-fan/link. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Peng Bo <bo.peng1019@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-20 13:42:25 -07:00
Yuming Wang	5dda1fe296	[SPARK-27699][FOLLOW-UP][SQL][test-hadoop3.2][test-maven] Fix hadoop-3.2 test error ## What changes were proposed in this pull request? This pr fix `hadoop-3.2` test error: ``` - SPARK-27699 Converting disjunctions into ORC SearchArguments * FAILED * Expected "...SS_THAN_EQUALS a 10)[ leaf-1 = (LESS_THAN a 1) ]expr = (or (not leaf...", but got "...SS_THAN_EQUALS a 10)[, leaf-1 = (LESS_THAN a 1), ]expr = (or (not leaf..." (HiveOrcFilterSuite.scala:445) ``` https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105514/consoleFull ## How was this patch tested? N/A Closes #24639 from wangyum/SPARK-27699. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-20 13:04:05 -07:00
Yuming Wang	974b879220	[SPARK-27694][SQL] Support auto-updating table statistics for data source CTAS command ## What changes were proposed in this pull request? This pr makes it support collect statistics when CTAS(create a data source table using the result of a query). ## How was this patch tested? unit tests and manual tests: ```sql bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true -S spark-sql> CREATE TABLE spark_27694 USING parquet AS SELECT 'a', 'b'; spark-sql> DESC FORMATTED spark_27694; a string NULL b string NULL # Detailed Table Information Database default Table spark_27694 Owner root Created Time Mon May 13 19:45:33 GMT-07:00 2019 Last Access Wed Dec 31 17:00:00 GMT-07:00 1969 Created By Spark 3.0.0-SNAPSHOT Type MANAGED Provider parquet Statistics 561 bytes Location file:/user/hive/warehouse/spark_27694 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` Closes #24596 from wangyum/SPARK-27694. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 22:29:40 -07:00
Ryan Blue	bc46feaced	[SPARK-27693][SQL] Add default catalog property Add a SQL config property for the default v2 catalog. Existing tests for regressions. Closes #24594 from rdblue/SPARK-27693-add-default-catalog-config. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 21:30:52 -07:00
HyukjinKwon	2431ab0999	[SPARK-27771][SQL] Add SQL description for grouping functions (cube, rollup, grouping and grouping_id) ## What changes were proposed in this pull request? Both look added as of 2.0 (see SPARK-12541 and SPARK-12706). I referred existing docs and examples in other API docs. ## How was this patch tested? Manually built the documentation and, by running examples, by running `DESCRIBE FUNCTION EXTENDED`. Closes #24642 from HyukjinKwon/SPARK-27771. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 19:26:20 -07:00
liuxian	9bca99b29b	[SPARK-27552][SQL] The configuration `hive.exec.stagingdir` is invalid on Windows OS ## What changes were proposed in this pull requesst? If we set `hive.exec.stagingdir=.test-staging\tmp`, But the staging directory is still `.hive-staging` on Windows OS. Reasons for failure: Test code: ``` val path = new Path("C:\\test\\hivetable") println("path.toString: " + path.toString) println("path.toUri.getPath: " + path.toUri.getPath) ``` Output: ``` path.toString: C:/test/hivetable path.toUri.getPath: /C:/test/hivetable ``` We can see that `path.toUri.getPath` has one more separator than `path.toString`, and the separator is ' / ', not ' \ ' So `stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")` will return false ## How was this patch tested? 1. Existed tests 2. Manual testing on Windows OS Closes #24446 from 10110346/stagingdir. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-17 14:00:17 -05:00
Gengliang Wang	e39e97b73a	[SPARK-27699][SQL] Partially push down disjunctive predicated in Parquet/ORC ## What changes were proposed in this pull request? Currently, in `ParquetFilters` and `OrcFilters`, if the child predicate of `Or` operator can't be entirely pushed down, the predicates will be thrown away. In fact, the conjunctive predicates under `Or` operators can be partially pushed down. For example, says `a` and `b` are convertible, while `c` can't be pushed down, the predicate `a or (b and c)` can be converted as `(a or b) and (a or c)` We can still push down `(a or b)`. We can't push down disjunctive predicates only when one of its children is not partially convertible. This PR also improve the filter pushing down logic in `DataSourceV2Strategy`. With partial filter push down in `Or` operator, the result of `pushedFilters()` might not exist in the mapping `translatedFilterToExpr`. To fix it, this PR changes the mapping `translatedFilterToExpr` as leaf filter expression to `sources.filter`, and later on rebuild the whole expression with the mapping. ## How was this patch tested? Unit test Closes #24598 from gengliangwang/pushdownDisjunctivePredicates. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-17 19:25:24 +08:00
Wenchen Fan	fc5bd6da77	[SPARK-27576][SQL] table capability to skip the output column resolution ## What changes were proposed in this pull request? Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. ## How was this patch tested? new test cases Closes #24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 16:24:53 -07:00
Shixiong Zhu	6a317c8f01	[SPARK-27735][SS] Parsing interval string should be case-insensitive in SS ## What changes were proposed in this pull request? Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. ## How was this patch tested? The new unit test. Closes #24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 13:58:27 -07:00
shivusondur	c6a45e6f67	[SPARK-27722][SQL] removed the unsed "UnsafeKeyValueSorter" file. ## What changes were proposed in this pull request? removed the unused "UnsafeKeyValueSorter.java" file ## How was this patch tested? Ran Compilation and UT locally. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24622 from shivusondur/jira27722. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-16 18:22:06 +08:00
Wenchen Fan	3e30a98810	[SPARK-27674][SQL] the hint should not be dropped after cache lookup ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20365 . #20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases. ## How was this patch tested? a new test Closes #24580 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 15:47:52 -07:00
Yuming Wang	02c33694c8	[SPARK-27354][SQL] Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1 ## What changes were proposed in this pull request? When we upgraded the built-in Hive to 2.3.4, the current `hive-thriftserver` module is not compatible, such as these Hive changes: 1. [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442) HiveServer2: Refactor/repackage HiveServer2's Thrift code so that it can be used in the tasks 2. [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237) Use slf4j as logging facade 3. [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) HiveServer2: Support delegation token based connection when using http transport So this PR moves the incompatible code to `sql/hive-thriftserver/v1.2.1` and copies it to `sql/hive-thriftserver/v2.3.4` for the next code review. ## How was this patch tested? manual tests: ``` diff -urNa sql/hive-thriftserver/v1.2.1 sql/hive-thriftserver/v2.3.4 ``` Closes #24282 from wangyum/SPARK-27354. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 14:52:08 -07:00
Xingbo Jiang	0bba5cf568	[SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout ## What changes were proposed in this pull request? In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread. ## How was this patch tested? Add new test suite `BroadcastExchangeExec` Closes #24595 from jiangxb1987/SPARK-20774. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 14:47:15 -07:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
xy_xin	fd9acf23b0	[SPARK-27713][SQL] Move org.apache.spark.sql.execution.* in catalyst to core ## What changes were proposed in this pull request? `RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan. ## How was this patch tested? exist tests. Closes #24607 from xianyinxin/SPARK-27713. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 15:24:21 +08:00
Ryan Blue	2da5b21834	[SPARK-24923][SQL] Implement v2 CreateTableAsSelect ## What changes were proposed in this pull request? This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation ## How was this patch tested? We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes #24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 11:24:03 +08:00
Yuming Wang	fee695d0cf	[SPARK-27690][SQL] Remove materialized views first in `HiveClientImpl.reset` ## What changes were proposed in this pull request? We should remove materialized view first otherwise(note that Hive 3.1 could reproduce this issue): ```scala Cause: org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'MV_TABLES_USED_FK2' for key (4). The statement has been rolled back. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeBatchElement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeBatch(Unknown Source) ``` ## How was this patch tested? Existing test Closes #24592 from wangyum/SPARK-27690. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-14 09:05:22 -07:00
Sean Owen	a10608cb82	[SPARK-27680][CORE][SQL][GRAPHX] Remove usage of Traversable ## What changes were proposed in this pull request? This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below. ## How was this patch tested? Existing tests. Closes #24584 from srowen/SPARK-27680. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-14 09:14:56 -05:00
mingbo.pb	66f5a42ca5	[SPARK-27638][SQL] Cast string to date/timestamp in binary comparisons with dates/timestamps ## What changes were proposed in this pull request? The below example works with both Mysql and Hive, however not with spark. ``` mysql> select * from date_test where date_col >= '2000-1-1'; +------------+ \| date_col \| +------------+ \| 2000-01-01 \| +------------+ ``` The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c As below date patterns have been supported, the PR is to cast string to date when comparing string and date: ``` `yyyy` `yyyy-[m]m` `yyyy-[m]m-[d]d` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]dT ``` ## How was this patch tested? UT has been added Closes #24567 from pengbo/SPARK-27638. Authored-by: mingbo.pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-14 17:10:36 +08:00
Liang-Chi Hsieh	8b0bdaa8e0	[SPARK-27671][SQL] Fix error when casting from a nested null in a struct ## What changes were proposed in this pull request? When a null in a nested field in struct, casting from the struct throws error, currently. ```scala scala> sql("select cast(struct(1, null) as struct<a:int,b:int>)").show scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603) ``` Similarly, inline table, which casts null in nested field under the hood, also throws an error. ```scala scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', null)) AS tab(x, y)").show org.apache.spark.sql.AnalysisException: failed to evaluate expression named_struct('col1', 10, 'col2', NULL): NullType (of class org.apache.spark.sql.t ypes.NullType$); line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106) ``` This fixes the issue. ## How was this patch tested? Added tests. Closes #24576 from viirya/cast-null. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-13 12:40:46 -07:00
Yuming Wang	f3ddd6f9da	[SPARK-27402][SQL][TEST-HADOOP3.2][TEST-MAVEN] Fix hadoop-3.2 test issue(except the hive-thriftserver module) ## What changes were proposed in this pull request? This pr fix hadoop-3.2 test issues(except the `hive-thriftserver` module): 1. Add `hive.metastore.schema.verification` and `datanucleus.schema.autoCreateAll` to HiveConf. 2. hadoop-3.2 support access the Hive metastore from 0.12 to 2.2 After [SPARK-27176](https://issues.apache.org/jira/browse/SPARK-27176) and this PR, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues: - [HIVE-6727](https://issues.apache.org/jira/browse/HIVE-6727): Table level stats for external tables are set incorrectly. - [HIVE-15653](https://issues.apache.org/jira/browse/HIVE-15653): Some ALTER TABLE commands drop table stats. - [SPARK-12014](https://issues.apache.org/jira/browse/SPARK-12014): Spark SQL query containing semicolon is broken in Beeline. - [SPARK-25193](https://issues.apache.org/jira/browse/SPARK-25193): insert overwrite doesn't throw exception when drop old data fails. - [SPARK-25919](https://issues.apache.org/jira/browse/SPARK-25919): Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned. - [SPARK-26332](https://issues.apache.org/jira/browse/SPARK-26332): Spark sql write orc table on viewFS throws exception. - [SPARK-26437](https://issues.apache.org/jira/browse/SPARK-26437): Decimal data becomes bigint to query, unable to query. ## How was this patch tested? This pr test Spark’s Hadoop 3.2 profile on jenkins and #24591 test Spark’s Hadoop 2.7 profile on jenkins This PR close #24591 Closes #24391 from wangyum/SPARK-27402. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-13 10:35:26 -07:00
Liang-Chi Hsieh	d169b0aac3	[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions ## What changes were proposed in this pull request? This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. ## How was this patch tested? Added tests. Closes #24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-13 22:37:34 +08:00
zhoukang	126310ca68	[SPARK-26601][SQL] Make broadcast-exchange thread pool configurable ## What changes were proposed in this pull request? Currently,thread number of broadcast-exchange thread pool is fixed and keepAliveSeconds is also fixed as 60s. ``` object BroadcastExchangeExec { private[execution] val executionContext = ExecutionContext.fromExecutorService( ThreadUtils.newDaemonCachedThreadPool("broadcast-exchange", 128)) } /** * Create a cached thread pool whose max number of threads is `maxThreadNumber`. Thread names * are formatted as prefix-ID, where ID is a unique, sequentially assigned integer. */ def newDaemonCachedThreadPool( prefix: String, maxThreadNumber: Int, keepAliveSeconds: Int = 60): ThreadPoolExecutor = { val threadFactory = namedThreadFactory(prefix) val threadPool = new ThreadPoolExecutor( maxThreadNumber, // corePoolSize: the max number of threads to create before queuing the tasks maxThreadNumber, // maximumPoolSize: because we use LinkedBlockingDeque, this one is not used keepAliveSeconds, TimeUnit.SECONDS, new LinkedBlockingQueue[Runnable], threadFactory) threadPool.allowCoreThreadTimeOut(true) threadPool } ``` But some times, if the Thead object do not GC quickly it may caused server(driver) OOM. In such case,we need to make this thread pool configurable. A case has described in https://issues.apache.org/jira/browse/SPARK-26601 ## How was this patch tested? UT Closes #23670 from caneGuy/zhoukang/make-broadcat-config. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-13 20:40:21 +09:00
Gengliang Wang	be6d39c379	[SPARK-27668][SQL] File source V2: support reporting statistics ## What changes were proposed in this pull request? In File source V1, the statistics of `HadoopFsRelation` is `compressionFactor * sizeInBytesOfAllFiles`. To follow it, we can implement the interface SupportsReportStatistics in FileScan and report the same statistics. ## How was this patch tested? Unit test Closes #24571 from gengliangwang/stats. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-13 14:16:11 +08:00
Wenchen Fan	9ff77b198e	[SPARK-27675][SQL] do not use MutableColumnarRow in ColumnarBatch ## What changes were proposed in this pull request? To move DS v2 API to the catalyst module, we can't refer to an internal class (`MutableColumnarRow`) in `ColumnarBatch`. This PR creates a read-only version of `MutableColumnarRow`, and use it in `ColumnarBatch`. close https://github.com/apache/spark/pull/24546 ## How was this patch tested? existing tests Closes #24581 from cloud-fan/mutable-row. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-12 19:59:56 +09:00
HyukjinKwon	c71f217de1	[SPARK-27673][SQL] Add `since` info to random, regex, null expressions ## What changes were proposed in this pull request? We should add since info to all expressions. SPARK-7886 Rand / Randn `af3746ce0d` RLike, Like (I manually checked that it exists from 1.0.0) SPARK-8262 Split SPARK-8256 RegExpReplace SPARK-8255 RegExpExtract `9aadcffabd` Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0) SPARK-14541 IfNull / NullIf / Nvl / Nvl2 SPARK-9080 IsNaN SPARK-9168 NaNvl ## How was this patch tested? N/A Closes #24579 from HyukjinKwon/SPARK-27673. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 09:24:04 -07:00
HyukjinKwon	3442fcaa9b	[SPARK-27672][SQL] Add `since` info to string expressions ## What changes were proposed in this pull request? This PR adds since information to the all string expressions below: SPARK-8241 ConcatWs SPARK-16276 Elt SPARK-1995 Upper / Lower SPARK-20750 StringReplace SPARK-8266 StringTranslate SPARK-8244 FindInSet SPARK-8253 StringTrimLeft SPARK-8260 StringTrimRight SPARK-8267 StringTrim SPARK-8247 StringInstr SPARK-8264 SubstringIndex SPARK-8249 StringLocate SPARK-8252 StringLPad SPARK-8259 StringRPad SPARK-16281 ParseUrl SPARK-9154 FormatString SPARK-8269 Initcap SPARK-8257 StringRepeat SPARK-8261 StringSpace SPARK-8263 Substring SPARK-21007 Right SPARK-21007 Left SPARK-8248 Length SPARK-20749 BitLength SPARK-20749 OctetLength SPARK-8270 Levenshtein SPARK-8271 SoundEx SPARK-8238 Ascii SPARK-20748 Chr SPARK-8239 Base64 SPARK-8268 UnBase64 SPARK-8242 Decode SPARK-8243 Encode SPARK-8245 format_number SPARK-16285 Sentences ## How was this patch tested? N/A Closes #24578 from HyukjinKwon/SPARK-27672. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 09:11:12 -07:00
Eric Liang	80de449f51	[MINOR][TEST] Fix schema mismatch error ## What changes were proposed in this pull request? - the accumulator warning is too verbose - when a test fails with schema mismatch, you never see the error message / exception Closes #24549 from ericl/test-nits. Lead-authored-by: Eric Liang <ekl@databricks.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-10 23:08:02 +08:00
Marco Gaido	78748b5752	[SPARK-27625][SQL] ScalaReflection support for annotated types ## What changes were proposed in this pull request? If a type is annotated, `ScalaReflection` can fail if the datatype is an `Option`, a `Seq`, a `Map` and other similar types. This is because it assumes we are dealing with `TypeRef`, while types with annotations are `AnnotatedType`. The PR deals with the case the annotation is present. ## How was this patch tested? added UT Closes #24564 from mgaido91/SPARK-27625. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-10 22:48:36 +08:00
Yuming Wang	b5ffec12eb	[SPARK-27563][FOLLOWUP] Fix to download new release from `dist.apache.org` ## What changes were proposed in this pull request? `https://archive.apache.org/dist/spark/` does not have latest Spark 2.4.3: <img src="https://user-images.githubusercontent.com/5399861/57288553-4264b600-70ec-11e9-8dcc-71b7589f5ad0.png" width="400"> This pr add `https://dist.apache.org/repos/dist/release/spark/` to mirrors list to download latest Spark. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105202/testReport/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ ## How was this patch tested? manual tests: ``` build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite" -Phive ``` Closes #24544 from wangyum/Unable-to-download-Spark-2.4.3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-09 08:45:52 -05:00
pgandhi	0969d7aa0c	[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… …rtBasedAggregate Normally, the aggregate operations that are invoked for an aggregation buffer for User Defined Aggregate Functions(UDAF) follow the order like initialize(), update(), eval() OR initialize(), merge(), eval(). However, after a certain threshold configurable by spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge or update operation without calling initialize() on the aggregate buffer. ## What changes were proposed in this pull request? The fix here is to initialize aggregate buffers again when fallback to SortBasedAggregate operator happens. ## How was this patch tested? The patch was tested as part of [SPARK-24935](https://issues.apache.org/jira/browse/SPARK-24935) as documented in PR https://github.com/apache/spark/pull/23778. Closes #24149 from pgandhi999/SPARK-27207. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-09 11:12:20 +08:00
Gengliang Wang	78a403fab9	[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources ## What changes were proposed in this pull request? ### Background: The data source option `pathGlobFilter` is introduced for Binary file format: https://github.com/apache/spark/pull/24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. ### Proposal: Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. ### Motivation: Filtering the file path names in file scan tasks on executors is kind of ugly. ### Impact: 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. ## How was this patch tested? Unit tests Closes #24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-09 08:41:43 +09:00
gengjiaan	57450ed9b7	[MINOR][SS] Rename `secondLatestBatchId` to `secondLatestOffsets` ## What changes were proposed in this pull request? The mothod `populateStartOffsets` exists a inappropriate identifier `secondLatestBatchId`. I think `secondLatestBatchId = latestBatchId - 1` and `offsetLog.get(latestBatchId - 1)` is a offset. So I change the identifier as follows: `secondLatestOffsets = offsetLog.get(latestBatchId - 1)` ## How was this patch tested? Exists UT. Closes #24550 from beliefer/fix-inappropriate-identifier. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-08 11:01:26 -07:00
Wenchen Fan	bae5baae52	[SPARK-27642][SS] make v1 offset extends v2 offset ## What changes were proposed in this pull request? To move DS v2 to the catalyst module, we can't make v2 offset rely on v1 offset, as v1 offset is in sql/core. ## How was this patch tested? existing tests Closes #24538 from cloud-fan/offset. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-07 23:03:15 -07:00
Yuming Wang	3ea44e52e7	[SPARK-27639][SQL] InMemoryTableScan shows the table name on UI if possible ## What changes were proposed in this pull request? <img src="https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png" width="200"> It only shows `InMemoryTableScan` when scanning InMemoryTable. When there are many InMemoryTables, it is difficult to distinguish which one is what we are looking for. This PR show the table name when scanning InMemoryTable. ## How was this patch tested? unit tests and manual tests After this PR: <img src="https://user-images.githubusercontent.com/5399861/57269120-d3219e80-70b8-11e9-9e56-1b5d4c071660.png" width="200"> Closes #24534 from wangyum/SPARK-27639. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-07 21:00:13 -07:00
Jose Torres	83f628b57d	[SPARK-27253][SQL][FOLLOW-UP] Add a legacy flag to restore old session init behavior ## What changes were proposed in this pull request? Add a legacy flag to restore the old session init behavior, where SparkConf defaults take precedence over configs in a parent session. Closes #24540 from jose-torres/oss. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-07 20:04:09 -07:00
Ryan Blue	303ee3fce0	[SPARK-24252][SQL] Add TableCatalog API ## What changes were proposed in this pull request? This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move. ## How was this patch tested? This adds a test implementation and contract tests. Closes #24246 from rdblue/SPARK-24252-add-table-catalog-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-08 10:31:06 +08:00
Adi Muraru	8ef4da753d	[SPARK-27610][YARN] Shade netty native libraries ## What changes were proposed in this pull request? Fixed the `spark-<version>-yarn-shuffle.jar` artifact packaging to shade the native netty libraries: - shade the `META-INF/native/libnetty_*` native libraries when packagin the yarn shuffle service jar. This is required as netty library loader derives that based on shaded package name. - updated the `org/spark_project` shade package prefix to `org/sparkproject` (i.e. removed underscore) as the former breaks the netty native lib loading. This was causing the yarn external shuffle service to fail when spark.shuffle.io.mode=EPOLL ## How was this patch tested? Manual tests Closes #24502 from amuraru/SPARK-27610_master. Authored-by: Adi Muraru <amuraru@adobe.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 10:47:36 -07:00
Wenchen Fan	6ef45301a4	[SPARK-27579][SQL] remove BaseStreamingSource and BaseStreamingSink ## What changes were proposed in this pull request? `BaseStreamingSource` and `BaseStreamingSink` is used to unify v1 and v2 streaming data source API in some code paths. This PR removes these 2 interfaces, and let the v1 API extend v2 API to keep API compatibility. The motivation is https://github.com/apache/spark/pull/24416 . We want to move data source v2 to catalyst module, but `BaseStreamingSource` and `BaseStreamingSink` are in sql/core. ## How was this patch tested? existing tests Closes #24471 from cloud-fan/streaming. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-06 20:41:57 +08:00
Liang-Chi Hsieh	4b725e50a7	[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans ## What changes were proposed in this pull request? Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain(true) ``` Before: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#2L] +- SubqueryAlias `tmp001` +- Project [id#2L] +- SubqueryAlias `test2` +- Range (0, 5, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 5, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#0L] +- SubqueryAlias `tmp001` +- Project [id#0L] +- SubqueryAlias `test` +- Range (0, 10, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` To fix it, this passes query execution of Dataset when explaining it. The query execution contains pre-analyzed plan which is consistent with Dataset's result. ## How was this patch tested? Manually test and unit test. Closes #24464 from viirya/SPARK-27439-2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-05 23:19:19 -07:00
Dilip Biswal	6001d476ce	[SPARK-27596][SQL] The JDBC 'query' option doesn't work for Oracle database ## What changes were proposed in this pull request? Description from JIRA For the JDBC option `query`, we use the identifier name to start with underscore: s"(${subquery}) _SPARK_GEN_JDBC_SUBQUERY_NAME${curId.getAndIncrement()}". This is not supported by Oracle. The Oracle doesn't seem to support identifier name to start with non-alphabet character (unless it is quoted) and has length restrictions as well. [link](https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm) In this PR, the generated alias name 'SPARK_GEN_JDBC_SUBQUERY_NAME<int value>' is fixed to remove "_" prefix and also the alias name is shortened to not exceed the identifier length limit. ## How was this patch tested? Tests are added for MySql, Postgress, Oracle and DB2 to ensure enough coverage. Closes #24532 from dilipbiswal/SPARK-27596. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-05 21:52:23 -07:00

1 2 3 4 5 ...

7882 commits