ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
John Zhuge	a44b00dfe0	[SPARK-27813][SQL] DataSourceV2: Add DropTable logical operation ## What changes were proposed in this pull request? Support DROP TABLE from V2 catalogs. Move DROP TABLE into catalyst. Move parsing tests for DROP TABLE/VIEW to PlanResolutionSuite to validate existing behavior. Add new tests fo catalyst parser suite. Separate DROP VIEW into different code path from DROP TABLE. Move DROP VIEW into catalyst as a new operator. Add a meaningful exception to indicate view is not currently supported in v2 catalog. ## How was this patch tested? New unit tests. Existing unit tests in catalyst and sql core. Closes #24686 from jzhuge/SPARK-27813-pr. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-31 00:56:07 +08:00
Yuming Wang	db3e746b64	[SPARK-27875][CORE][SQL][ML][K8S] Wrap all PrintWriter with Utils.tryWithResource ## What changes were proposed in this pull request? This pr wrap all `PrintWriter` with `Utils.tryWithResource` to prevent resource leak. ## How was this patch tested? Existing test Closes #24739 from wangyum/SPARK-27875. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-30 19:54:32 +09:00
John Zhuge	953b8e8206	[SPARK-26946][SQL][FOLLOWUP] Require lookup function ## What changes were proposed in this pull request? Require the lookup function with interface LookupCatalog. Rationale is in the review comments below. Make `Analyzer` abstract. BaseSessionStateBuilder and HiveSessionStateBuilder implements lookupCatalog with a call to SparkSession.catalog(). Existing test cases and those that don't need catalog lookup will use a newly added `TestAnalyzer` with a default lookup function that throws` CatalogNotFoundException("No catalog lookup function")`. Rewrote the unit test for LookupCatalog to demonstrate the interface can be used anywhere, not just Analyzer. Removed Analyzer parameter `lookupCatalog` because we can override in the following manner: ``` new Analyzer() { override def lookupCatalog(name: String): CatalogPlugin = ??? } ``` ## How was this patch tested? Existing unit tests. Closes #24689 from jzhuge/SPARK-26946-follow. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-30 09:22:42 +08:00
Wenchen Fan	6506616b97	[SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/22104 , we create the python-eval nodes at the end of the optimization phase, which causes a problem. After the main optimization batch, Filter and Project nodes are usually pushed to the bottom, near the scan node. However, if we extract Python UDFs from Filter/Project, and create a python-eval node under Filter/Project, it will break column pruning/filter pushdown of the scan node. There are some hacks in the `ExtractPythonUDFs` rule, to duplicate the column pruning and filter pushdown logic. However, it has some bugs as demonstrated in the new test case(only column pruning is broken). This PR removes the hacks and re-apply the column pruning and filter pushdown rules explicitly. Before: ``` ... == Analyzed Logical Plan == a: bigint Project [a#168L] +- Filter dummyUDF(a#168L) +- Relation[a#168L,b#169L] parquet == Optimized Logical Plan == Project [a#168L] +- Project [a#168L, b#169L] +- Filter pythonUDF0#174: boolean +- BatchEvalPython [dummyUDF(a#168L)], [a#168L, b#169L, pythonUDF0#174] +- Relation[a#168L,b#169L] parquet == Physical Plan == (2) Project [a#168L] +- (2) Project [a#168L, b#169L] +- (2) Filter pythonUDF0#174: boolean +- BatchEvalPython [dummyUDF(a#168L)], [a#168L, b#169L, pythonUDF0#174] +- (1) FileScan parquet [a#168L,b#169L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/spark-798bae3c-a2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> ``` After: ``` ... == Analyzed Logical Plan == a: bigint Project [a#168L] +- Filter dummyUDF(a#168L) +- Relation[a#168L,b#169L] parquet == Optimized Logical Plan == Project [a#168L] +- Filter pythonUDF0#174: boolean +- BatchEvalPython [dummyUDF(a#168L)], [pythonUDF0#174] +- Project [a#168L] +- Relation[a#168L,b#169L] parquet == Physical Plan == (2) Project [a#168L] +- (2) Filter pythonUDF0#174: boolean +- BatchEvalPython [dummyUDF(a#168L)], [pythonUDF0#174] +- *(1) FileScan parquet [a#168L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/spark-9500cafb-78..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint> ``` ## How was this patch tested? new test Closes #24675 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-27 21:39:59 +09:00
Yesheng Ma	5e3520f7f4	[SPARK-27809][SQL] Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement ## What changes were proposed in this pull request? Each time, when I write a complex CREATE DATABASE/VIEW statements, I have to open the .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When the order is not right, I will get A strange confusing error message generated from ANTLR4. The original g4 grammar for CREATE VIEW is ``` CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [db_name.]view_name [(col_name1 [COMMENT col_comment1], ...)] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] AS select_statement ``` The proposal is to make the following clauses order insensitive. ``` [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] ``` – The original g4 grammar for CREATE DATABASE is ``` CREATE (DATABASE\|SCHEMA) [IF NOT EXISTS] db_name [COMMENT comment_text] [LOCATION path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)] ``` The proposal is to make the following clauses order insensitive. ``` [COMMENT comment_text] [LOCATION path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)] ``` ## How was this patch tested? By adding new unit tests to test duplicate clauses and modifying some existing unit tests to test whether those clauses are actually order insensitive Closes #24681 from yeshengm/create-view-parser. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-24 15:19:14 -07:00
maryannxue	de13f70ce1	[SPARK-27824][SQL] Make rule EliminateResolvedHint idempotent ## What changes were proposed in this pull request? This fix prevents the rule EliminateResolvedHint from being applied again if it's already applied. ## How was this patch tested? Added new UT. Closes #24692 from maryannxue/eliminatehint-bug. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-24 11:25:22 -07:00
Ryan Blue	6b28497d6f	[SPARK-27732][SQL] Add v2 CreateTable implementation. ## What changes were proposed in this pull request? This adds a v2 implementation of create table: * `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan * `CreateTableExec` is the physical plan ## How was this patch tested? Added resolution and v2 SQL tests. Closes #24617 from rdblue/SPARK-27732-add-v2-create-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-24 11:13:22 +08:00
gatorsmile	f94247ec90	[SPARK-27770][SQL][PART 1] Port AGGREGATES.sql ## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. `02ddd49932/src/test/regress/sql/aggregates.sql (L1-L143)` The expected results can be found in the link: https://github.com/postgres/postgres/blob/master/src/test/regress/expected/aggregates.out When porting the test cases, found three PostgreSQL specific features that do not exist in Spark SQL. - https://issues.apache.org/jira/browse/SPARK-27765: Type Casts: expression::type - https://issues.apache.org/jira/browse/SPARK-27766: Data type: POINT(x, y) - https://issues.apache.org/jira/browse/SPARK-27767: Built-in function: generate_series Also, found two bugs: - https://issues.apache.org/jira/browse/SPARK-27768: Infinity, -Infinity, NaN should be recognized in a case insensitive manner - https://issues.apache.org/jira/browse/SPARK-27769: Handling of sublinks within outer-level aggregates. This PR also fixes the error message when the column can't be resolved. For running the regression tests, this PR also added three tables `aggtest`, `onek` and `tenk1` from the postgreSQL data sets: `02ddd49932/src/test/regress/data` ## How was this patch tested? N/A Closes #24640 from gatorsmile/addTestCase. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-05-23 16:34:37 -07:00
HyukjinKwon	c1e555711b	Revert "Revert "[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values"" This reverts commit `855399bbad`.	2019-05-24 05:36:17 +09:00
HyukjinKwon	1ba4011a7f	Revert "Revert "[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit…"" This reverts commit `516b0fb537`.	2019-05-24 05:36:08 +09:00
Wenchen Fan	1a68fc38f0	[SPARK-27816][SQL] make TreeNode tag type safe ## What changes were proposed in this pull request? Add type parameter to `TreeNodeTag`. ## How was this patch tested? existing tests Closes #24687 from cloud-fan/tag. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-23 11:53:21 -07:00
HyukjinKwon	516b0fb537	Revert "[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit…" This reverts commit `40668c53ed`.	2019-05-24 03:17:06 +09:00
HyukjinKwon	855399bbad	Revert "[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values" This reverts commit `42cb4a2ccd`.	2019-05-24 03:16:24 +09:00
Wenchen Fan	a590a935b1	[SPARK-27806][SQL] byName/byPosition should apply to struct fields as well ## What changes were proposed in this pull request? When writing a query to data source v2, we have 2 modes to resolve the input query's output: byName or byPosition. For byName mode, we would reorder the top level columns according to the name, and add type cast if possible. If the names don't match, we fail. For byPosition mode, we don't do the reorder, and just add type cast directly if possible. However, for struct type fields, we always apply byName mode. We should ignore the name difference if byPosition mode is used. ## How was this patch tested? new tests Closes #24678 from cloud-fan/write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-23 10:37:45 -07:00
Liu Xiao	bf617996aa	[SPARK-27800][SQL][DOC] Fix wrong answer of example for BitwiseXor ## What changes were proposed in this pull request? Fix example for bitwise xor function. 3 ^ 5 should be 6 rather than 2. - See https://spark.apache.org/docs/latest/api/sql/index.html#_14 ## How was this patch tested? manual tests Closes #24669 from alex-lx/master. Authored-by: Liu Xiao <hhdxlx@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 21:52:19 -07:00
Wenchen Fan	03c9e8adee	[SPARK-24586][SQL] Upcast should not allow casting from string to other types ## What changes were proposed in this pull request? When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc. As a result, `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter. This PR does 2 things: 1. rename `Cast.canSafeCast` to `Cast.canUpcast`, and support complex typres 2. remove `Cast.mayTruncate` and replace it with `!Cast.canUpcast` Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here. ## How was this patch tested? new tests Closes #21586 from cloud-fan/cast. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:35:51 +08:00
Wenchen Fan	1e0facb60d	[SQL][DOC][MINOR] update documents for Table and WriteBuilder ## What changes were proposed in this pull request? Update the docs to reflect the changes made by https://github.com/apache/spark/pull/24129 ## How was this patch tested? N/A Closes #24658 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 09:29:06 -07:00
Josh Rosen	604aa1b045	[SPARK-27786][SQL] Fix Sha1, Md5, and Base64 codegen when commons-codec is shaded ## What changes were proposed in this pull request? When running a custom build of Spark which shades `commons-codec`, the `Sha1` expression generates code which fails to compile: ``` org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 93: A method named "sha1Hex" is not declared in any enclosing class nor any supertype, nor through a static import ``` This is caused by an interaction between Spark's code generator and the shading: the current codegen template includes the string `org.apache.commons.codec.digest.DigestUtils.sha1Hex` as part of a larger string literal, preventing JarJarLinks from being able to replace the class name with the shaded class's name. As a result, the generated code still references the original unshaded class name name, triggering an error in case the original unshaded dependency isn't on the path. This problem impacts the `Sha1`, `Md5`, and `Base64` expressions. To fix this problem and allow for proper shading, this PR updates the codegen templates to replace the hardcoded class names with `${classof[<name>].getName}` calls. ## How was this patch tested? Existing tests. To ensure that I found all occurrences of this problem, I used IntelliJ's "Find in Path" to search for lines matching the regex `^(?!import\|package).*(org\|com\|net\|io)\.(?!apache\.spark)` and then filtered matches to inspect only non-test "Usage in string constants" cases. This isn't _perfect_ but I think it'll catch most cases. Closes #24655 from JoshRosen/fix-shaded-apache-commons. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-21 21:18:34 +08:00
Wenchen Fan	0e6601acdf	[SPARK-27747][SQL] add a logical plan link in the physical plan ## What changes were proposed in this pull request? It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in https://github.com/apache/spark/pull/24389 This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied. The planner leverages this feature to put the logical plan into the physical plan. ## How was this patch tested? a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans. Closes #24626 from cloud-fan/link. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Peng Bo <bo.peng1019@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-20 13:42:25 -07:00
Ryan Blue	bc46feaced	[SPARK-27693][SQL] Add default catalog property Add a SQL config property for the default v2 catalog. Existing tests for regressions. Closes #24594 from rdblue/SPARK-27693-add-default-catalog-config. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 21:30:52 -07:00
HyukjinKwon	2431ab0999	[SPARK-27771][SQL] Add SQL description for grouping functions (cube, rollup, grouping and grouping_id) ## What changes were proposed in this pull request? Both look added as of 2.0 (see SPARK-12541 and SPARK-12706). I referred existing docs and examples in other API docs. ## How was this patch tested? Manually built the documentation and, by running examples, by running `DESCRIBE FUNCTION EXTENDED`. Closes #24642 from HyukjinKwon/SPARK-27771. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 19:26:20 -07:00
Wenchen Fan	fc5bd6da77	[SPARK-27576][SQL] table capability to skip the output column resolution ## What changes were proposed in this pull request? Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. ## How was this patch tested? new test cases Closes #24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 16:24:53 -07:00
Shixiong Zhu	6a317c8f01	[SPARK-27735][SS] Parsing interval string should be case-insensitive in SS ## What changes were proposed in this pull request? Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. ## How was this patch tested? The new unit test. Closes #24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 13:58:27 -07:00
Wenchen Fan	3e30a98810	[SPARK-27674][SQL] the hint should not be dropped after cache lookup ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20365 . #20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases. ## How was this patch tested? a new test Closes #24580 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 15:47:52 -07:00
Xingbo Jiang	0bba5cf568	[SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout ## What changes were proposed in this pull request? In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread. ## How was this patch tested? Add new test suite `BroadcastExchangeExec` Closes #24595 from jiangxb1987/SPARK-20774. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 14:47:15 -07:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
xy_xin	fd9acf23b0	[SPARK-27713][SQL] Move org.apache.spark.sql.execution.* in catalyst to core ## What changes were proposed in this pull request? `RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan. ## How was this patch tested? exist tests. Closes #24607 from xianyinxin/SPARK-27713. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 15:24:21 +08:00
Ryan Blue	2da5b21834	[SPARK-24923][SQL] Implement v2 CreateTableAsSelect ## What changes were proposed in this pull request? This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation ## How was this patch tested? We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes #24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 11:24:03 +08:00
Sean Owen	a10608cb82	[SPARK-27680][CORE][SQL][GRAPHX] Remove usage of Traversable ## What changes were proposed in this pull request? This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below. ## How was this patch tested? Existing tests. Closes #24584 from srowen/SPARK-27680. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-14 09:14:56 -05:00
mingbo.pb	66f5a42ca5	[SPARK-27638][SQL] Cast string to date/timestamp in binary comparisons with dates/timestamps ## What changes were proposed in this pull request? The below example works with both Mysql and Hive, however not with spark. ``` mysql> select * from date_test where date_col >= '2000-1-1'; +------------+ \| date_col \| +------------+ \| 2000-01-01 \| +------------+ ``` The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c As below date patterns have been supported, the PR is to cast string to date when comparing string and date: ``` `yyyy` `yyyy-[m]m` `yyyy-[m]m-[d]d` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]dT ``` ## How was this patch tested? UT has been added Closes #24567 from pengbo/SPARK-27638. Authored-by: mingbo.pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-14 17:10:36 +08:00
Liang-Chi Hsieh	8b0bdaa8e0	[SPARK-27671][SQL] Fix error when casting from a nested null in a struct ## What changes were proposed in this pull request? When a null in a nested field in struct, casting from the struct throws error, currently. ```scala scala> sql("select cast(struct(1, null) as struct<a:int,b:int>)").show scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603) ``` Similarly, inline table, which casts null in nested field under the hood, also throws an error. ```scala scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', null)) AS tab(x, y)").show org.apache.spark.sql.AnalysisException: failed to evaluate expression named_struct('col1', 10, 'col2', NULL): NullType (of class org.apache.spark.sql.t ypes.NullType$); line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106) ``` This fixes the issue. ## How was this patch tested? Added tests. Closes #24576 from viirya/cast-null. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-13 12:40:46 -07:00
Liang-Chi Hsieh	d169b0aac3	[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions ## What changes were proposed in this pull request? This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. ## How was this patch tested? Added tests. Closes #24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-13 22:37:34 +08:00
zhoukang	126310ca68	[SPARK-26601][SQL] Make broadcast-exchange thread pool configurable ## What changes were proposed in this pull request? Currently,thread number of broadcast-exchange thread pool is fixed and keepAliveSeconds is also fixed as 60s. ``` object BroadcastExchangeExec { private[execution] val executionContext = ExecutionContext.fromExecutorService( ThreadUtils.newDaemonCachedThreadPool("broadcast-exchange", 128)) } /** * Create a cached thread pool whose max number of threads is `maxThreadNumber`. Thread names * are formatted as prefix-ID, where ID is a unique, sequentially assigned integer. */ def newDaemonCachedThreadPool( prefix: String, maxThreadNumber: Int, keepAliveSeconds: Int = 60): ThreadPoolExecutor = { val threadFactory = namedThreadFactory(prefix) val threadPool = new ThreadPoolExecutor( maxThreadNumber, // corePoolSize: the max number of threads to create before queuing the tasks maxThreadNumber, // maximumPoolSize: because we use LinkedBlockingDeque, this one is not used keepAliveSeconds, TimeUnit.SECONDS, new LinkedBlockingQueue[Runnable], threadFactory) threadPool.allowCoreThreadTimeOut(true) threadPool } ``` But some times, if the Thead object do not GC quickly it may caused server(driver) OOM. In such case,we need to make this thread pool configurable. A case has described in https://issues.apache.org/jira/browse/SPARK-26601 ## How was this patch tested? UT Closes #23670 from caneGuy/zhoukang/make-broadcat-config. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-13 20:40:21 +09:00
HyukjinKwon	c71f217de1	[SPARK-27673][SQL] Add `since` info to random, regex, null expressions ## What changes were proposed in this pull request? We should add since info to all expressions. SPARK-7886 Rand / Randn `af3746ce0d` RLike, Like (I manually checked that it exists from 1.0.0) SPARK-8262 Split SPARK-8256 RegExpReplace SPARK-8255 RegExpExtract `9aadcffabd` Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0) SPARK-14541 IfNull / NullIf / Nvl / Nvl2 SPARK-9080 IsNaN SPARK-9168 NaNvl ## How was this patch tested? N/A Closes #24579 from HyukjinKwon/SPARK-27673. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 09:24:04 -07:00
HyukjinKwon	3442fcaa9b	[SPARK-27672][SQL] Add `since` info to string expressions ## What changes were proposed in this pull request? This PR adds since information to the all string expressions below: SPARK-8241 ConcatWs SPARK-16276 Elt SPARK-1995 Upper / Lower SPARK-20750 StringReplace SPARK-8266 StringTranslate SPARK-8244 FindInSet SPARK-8253 StringTrimLeft SPARK-8260 StringTrimRight SPARK-8267 StringTrim SPARK-8247 StringInstr SPARK-8264 SubstringIndex SPARK-8249 StringLocate SPARK-8252 StringLPad SPARK-8259 StringRPad SPARK-16281 ParseUrl SPARK-9154 FormatString SPARK-8269 Initcap SPARK-8257 StringRepeat SPARK-8261 StringSpace SPARK-8263 Substring SPARK-21007 Right SPARK-21007 Left SPARK-8248 Length SPARK-20749 BitLength SPARK-20749 OctetLength SPARK-8270 Levenshtein SPARK-8271 SoundEx SPARK-8238 Ascii SPARK-20748 Chr SPARK-8239 Base64 SPARK-8268 UnBase64 SPARK-8242 Decode SPARK-8243 Encode SPARK-8245 format_number SPARK-16285 Sentences ## How was this patch tested? N/A Closes #24578 from HyukjinKwon/SPARK-27672. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 09:11:12 -07:00
Marco Gaido	78748b5752	[SPARK-27625][SQL] ScalaReflection support for annotated types ## What changes were proposed in this pull request? If a type is annotated, `ScalaReflection` can fail if the datatype is an `Option`, a `Seq`, a `Map` and other similar types. This is because it assumes we are dealing with `TypeRef`, while types with annotations are `AnnotatedType`. The PR deals with the case the annotation is present. ## How was this patch tested? added UT Closes #24564 from mgaido91/SPARK-27625. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-10 22:48:36 +08:00
pgandhi	0969d7aa0c	[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… …rtBasedAggregate Normally, the aggregate operations that are invoked for an aggregation buffer for User Defined Aggregate Functions(UDAF) follow the order like initialize(), update(), eval() OR initialize(), merge(), eval(). However, after a certain threshold configurable by spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge or update operation without calling initialize() on the aggregate buffer. ## What changes were proposed in this pull request? The fix here is to initialize aggregate buffers again when fallback to SortBasedAggregate operator happens. ## How was this patch tested? The patch was tested as part of [SPARK-24935](https://issues.apache.org/jira/browse/SPARK-24935) as documented in PR https://github.com/apache/spark/pull/23778. Closes #24149 from pgandhi999/SPARK-27207. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-09 11:12:20 +08:00
Jose Torres	83f628b57d	[SPARK-27253][SQL][FOLLOW-UP] Add a legacy flag to restore old session init behavior ## What changes were proposed in this pull request? Add a legacy flag to restore the old session init behavior, where SparkConf defaults take precedence over configs in a parent session. Closes #24540 from jose-torres/oss. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-07 20:04:09 -07:00
Ryan Blue	303ee3fce0	[SPARK-24252][SQL] Add TableCatalog API ## What changes were proposed in this pull request? This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move. ## How was this patch tested? This adds a test implementation and contract tests. Closes #24246 from rdblue/SPARK-24252-add-table-catalog-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-08 10:31:06 +08:00
Adi Muraru	8ef4da753d	[SPARK-27610][YARN] Shade netty native libraries ## What changes were proposed in this pull request? Fixed the `spark-<version>-yarn-shuffle.jar` artifact packaging to shade the native netty libraries: - shade the `META-INF/native/libnetty_*` native libraries when packagin the yarn shuffle service jar. This is required as netty library loader derives that based on shaded package name. - updated the `org/spark_project` shade package prefix to `org/sparkproject` (i.e. removed underscore) as the former breaks the netty native lib loading. This was causing the yarn external shuffle service to fail when spark.shuffle.io.mode=EPOLL ## How was this patch tested? Manual tests Closes #24502 from amuraru/SPARK-27610_master. Authored-by: Adi Muraru <amuraru@adobe.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 10:47:36 -07:00
gaoweikang	3859ca37d9	[SPARK-27586][SQL] Improve binary comparison: replace Scala's for-comprehension if statements with while loop ## What changes were proposed in this pull request? This PR replaces for-comprehension if statement with while loop to gain better performance in `TypeUtils.compareBinary`. ## How was this patch tested? Add UT to test old version and new version comparison result Closes #24494 from woudygao/opt_binary_compare. Authored-by: gaoweikang <gaoweikang@bytedance.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-02 20:33:27 -07:00
Marco Gaido	7a8cc8e071	[SPARK-27607][SQL] Improve Row.toString performance ## What changes were proposed in this pull request? `Row.toString` is currently causing the useless creation of an `Array` containing all the values in the row before generating the string containing it. This operation adds a considerable overhead. The PR proposes to avoid this operation in order to get a faster implementation. ## How was this patch tested? Run ```scala test("Row toString perf test") { val n = 100000 val rows = (1 to n).map { i => Row(i, i.toDouble, i.toString, i.toShort, true, null) } // warmup (1 to 10).foreach { _ => rows.foreach(_.toString) } val times = (1 to 100).map { _ => val t0 = System.nanoTime() rows.foreach(_.toString) val t1 = System.nanoTime() t1 - t0 } // scalastyle:off println println(s"Avg time on ${times.length} iterations for $n toString:" + s" ${times.sum.toDouble / times.length / 1e6} ms") // scalastyle:on println } ``` Before the PR: ``` Avg time on 100 iterations for 100000 toString: 61.08408419 ms ``` After the PR: ``` Avg time on 100 iterations for 100000 toString: 38.16539432 ms ``` This means the new implementation is about 1.60X faster than the original one. Closes #24505 from mgaido91/SPARK-27607. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-02 07:20:33 -07:00
HyukjinKwon	df8aa7ba8a	[SPARK-27606][SQL] Deprecate 'extended' field in ExpressionDescription/ExpressionInfo ## What changes were proposed in this pull request? After we added other fields, `arguments`, `examples`, `note` and `since` at SPARK-21485 and `deprecated` at SPARK-27328, we have nicer way to separately describe extended usages. `extended` field and method at `ExpressionDescription`/`ExpressionInfo` is now pretty useless - it's not used in Spark side and only exists to keep backward compatibility. This PR proposes to deprecate it. ## How was this patch tested? Manually checked the deprecation waring is properly shown. Closes #24500 from HyukjinKwon/SPARK-27606. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-02 21:10:00 +09:00
gatorsmile	2da406cae5	[SPARK-27618][SQL][FOLLOW-UP] Unnecessary access to externalCatalog ## What changes were proposed in this pull request? This PR is to add test cases for ensuring that we do not have unnecessary access to externalCatalog. In the future, we can follow these examples to improve our test coverage in this area. ## How was this patch tested? N/A Closes #24511 from gatorsmile/addTestcaseSpark-27618. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-01 20:09:46 -07:00
HyukjinKwon	3670826af6	[SPARK-26921][R][DOCS] Document Arrow optimization and vectorized R APIs ## What changes were proposed in this pull request? This PR adds SparkR with Arrow optimization documentation. Note that looks CRAN issue in Arrow side won't look likely fixed soon, IMHO, even after Spark 3.0. If it happen to be fixed, I will fix this doc too later. Another note is that Arrow R package itself requires R 3.5+. So, I intentionally didn't note this. ## How was this patch tested? Manually built and checked. Closes #24506 from HyukjinKwon/SPARK-26924. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-02 10:02:14 +09:00
Artem Kalchenko	a35043c9e2	[SPARK-27591][SQL] Fix UnivocityParser for UserDefinedType ## What changes were proposed in this pull request? Fix bug in UnivocityParser. makeConverter method didn't work correctly for UsedDefinedType ## How was this patch tested? A test suite for UnivocityParser has been extended. Closes #24496 from kalkolab/spark-27591. Authored-by: Artem Kalchenko <artem.kalchenko@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-01 08:27:51 +09:00
Xiangrui Meng	618d6bff71	[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files ## What changes were proposed in this pull request? If a file is too big (>2GB), we should fail fast and do not try to read the file. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24483 from mengxr/SPARK-27588. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-29 16:24:49 -07:00
Sean Owen	8a17d26784	[SPARK-27536][CORE][ML][SQL][STREAMING] Remove most use of scala.language.existentials ## What changes were proposed in this pull request? I want to get rid of as much use of `scala.language.existentials` as possible for 3.0. It's a complicated language feature that generates warnings unless this value is imported. It might even be on the way out of Scala: https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 For Spark, it comes up mostly where the code plays fast and loose with generic types, not the advanced situations you'll often see referenced where this feature is explained. For example, it comes up in cases where a function returns something like `(String, Class[_])`. Scala doesn't like matching this to any other instance of `(String, Class[_])` because doing so requires inferring the existence of some type that satisfies both. Seems obvious if the generic type is a wildcard, but, not technically something Scala likes to let you get away with. This is a large PR, and it only gets rid of _most_ instances of `scala.language.existentials`. The change should be all compile-time and shouldn't affect APIs or logic. Many of the changes simply touch up sloppiness about generic types, making the known correct value explicit in the code. Some fixes involve being more explicit about the existence of generic types in methods. For instance, `def foo(arg: Class[_])` seems innocent enough but should really be declared `def foo[T](arg: Class[T])` to let Scala select and fix a single type when evaluating calls to `foo`. For kind of surprising reasons, this comes up in places where code evaluates a tuple of things that involve a generic type, but is OK if the two parts of the tuple are evaluated separately. One key change was altering `Utils.classForName(...): Class[_]` to the more correct `Utils.classForName[T](...): Class[T]`. This caused a number of small but positive changes to callers that otherwise had to cast the result. In several tests, `Dataset[_]` was used where `DataFrame` seems to be the clear intent. Finally, in a few cases in MLlib, the return type `this.type` was used where there are no subclasses of the class that uses it. This really isn't needed and causes issues for Scala reasoning about the return type. These are just changed to be concrete classes as return types. After this change, we have only a few classes that still import `scala.language.existentials` (because modifying them would require extensive rewrites to fix) and no build warnings. ## How was this patch tested? Existing tests. Closes #24431 from srowen/SPARK-27536. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:02:01 -05:00
Jash Gala	90085a1847	[SPARK-23619][DOCS] Add output description for some generator expressions / functions ## What changes were proposed in this pull request? This PR addresses SPARK-23619: https://issues.apache.org/jira/browse/SPARK-23619 It adds additional comments indicating the default column names for the `explode` and `posexplode` functions in Spark-SQL. Functions for which comments have been updated so far: * stack * inline * explode * posexplode * explode_outer * posexplode_outer ## How was this patch tested? This is just a change in the comments. The package builds and tests successfullly after the change. Closes #23748 from jashgala/SPARK-23619. Authored-by: Jash Gala <jashgala@amazon.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-27 10:30:12 +09:00
uncleGen	6328be78f9	[MINOR][TEST][DOC] Execute action miss name message ## What changes were proposed in this pull request? some minor updates: - `Execute` action miss `name` message - typo in SS document - typo in SQLConf ## How was this patch tested? N/A Closes #24466 from uncleGen/minor-fix. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-27 09:28:31 +08:00
Liang-Chi Hsieh	8b86326521	[SPARK-27551][SQL] Improve error message of mismatched types for CASE WHEN ## What changes were proposed in this pull request? When there are mismatched types among cases or else values in case when expression, current error message is hard to read to figure out what and where the mismatch is. This patch simply improves the error message for mismatched types for case when. Before: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;; ``` After: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN array<struct<x:bigint>> ELSE arr ay<struct<y:bigint>> END;; ``` ## How was this patch tested? Added unit test. Closes #24453 from viirya/SPARK-27551. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-25 08:47:19 -07:00
HyukjinKwon	a30983db57	[SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. In branch-2.4, type inference path for decimal and parsing data are different. `2a8343121e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (L153)` `c284c4e1f6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala (L125)` So the code below: ```scala scala> spark.read.option("delimiter", "\|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root \|-- _c0: string (nullable = true) ``` In the current master, it now infers decimal as below: ``` root \|-- _c0: decimal(2,0) (nullable = true) ``` It happened after https://github.com/apache/spark/pull/22979 because, now after this PR, we only have one way to parse decimal: `7a83d71403/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala (L92)` After the fix: ``` root \|-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 16:22:07 +09:00
Gengliang Wang	00f2f311f7	[SPARK-27128][SQL] Migrate JSON to File Data Source V2 ## What changes were proposed in this pull request? Migrate JSON to File Data Source V2 ## How was this patch tested? Unit test Closes #24058 from gengliangwang/jsonV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-23 22:39:59 +08:00
pengbo	d9b2ce0f0f	[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values ## What changes were proposed in this pull request? This PR is follow up of https://github.com/apache/spark/pull/24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-22 20:30:08 -07:00
Maxim Gekk	43a73e387c	[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default ## What changes were proposed in this pull request? In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for timestamps written to parquet files. The type matches semantically to Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files. ## How was this patch tested? By existing test suites. Closes #24425 from MaxGekk/parquet-timestamp_micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:06:39 +09:00
Maxim Gekk	79d3bc0409	[SPARK-27438][SQL] Parse strings with timestamps by to_timestamp() in microsecond precision ## What changes were proposed in this pull request? In the PR, I propose to parse strings to timestamps in microsecond precision by the ` to_timestamp()` function if the specified pattern contains a sub-pattern for seconds fractions. Closes #24342 ## How was this patch tested? By `DateFunctionsSuite` and `DateExpressionsSuite` Closes #24420 from MaxGekk/to_timestamp-microseconds3. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-22 19:41:32 +08:00
Maxim Gekk	d61b3bc875	[SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types ## What changes were proposed in this pull request? In the PR, I propose more precise description of `TimestampType` and `DateType`, how they store timestamps and dates internally. Closes #24424 from MaxGekk/timestamp-date-type-doc. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-21 16:53:11 +09:00
Yifei Huang	163a6e2982	[SPARK-27514] Skip collapsing windows with empty window expressions ## What changes were proposed in this pull request? A previous change moved the removal of empty window expressions to the RemoveNoopOperations rule, which comes after the CollapseWindow rule. Therefore, by the time we get to CollapseWindow, we aren't guaranteed that empty windows have been removed. This change checks that the window expressions are not empty, and only collapses the windows if both windows are non-empty. A lengthier description and repro steps here: https://issues.apache.org/jira/browse/SPARK-27514 ## How was this patch tested? A unit test, plus I reran the breaking case mentioned in the Jira ticket. Closes #24411 from yifeih/yh/spark-27514. Authored-by: Yifei Huang <yifeih@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-19 14:04:44 +08:00
Kris Mok	50bdc9befa	[SPARK-27423][SQL][FOLLOWUP] Minor polishes to Cast codegen templates for Date <-> Timestamp ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24332 introduced an unnecessary `import` statement and two slight issues in the codegen templates in `Cast` for `Date` <-> `Timestamp`. This PR removes the unused import statement and fixes the slight codegen issue. The issue in those two codegen templates is this pattern: ```scala val zid = JavaCode.global( ctx.addReferenceObj("zoneId", zoneId, "java.time.ZoneId"), zoneId.getClass) ``` `zoneId` can refer to an instance of a non-public class, e.g. `java.time.ZoneRegion`, and while this code correctly puts in the 3rd argument to `ctx.addReferenceObj()`, it's still passing `zoneId.getClass` to `JavaCode.global()` which is not desirable, but doesn't cause any immediate bugs in this particular case, because `zid` is used in an expression immediately afterwards. If this `zid` ever needs to spill to any explicitly typed variables, e.g. a local variable, and if the spill handling uses the `javaType` on this `GlobalVariable`, it'd generate code that looks like: ```java java.time.ZoneRegion value1 = ((java.time.ZoneId) references[2] /* literal */); ``` which would then be a real bug: - a non-accessible type `java.time.ZoneRegion` is referenced in the generated code, and - `ZoneId` -> `ZoneRegion` requires an explicit downcast. ## How was this patch tested? Existing tests. This PR does not change behavior, and the original PR won't cause any real behavior bug to begin with. Closes #24392 from rednaxelafx/spark-27423-followup. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 14:27:33 +08:00
Dilip Biswal	e1c90d66bb	[SPARK-19712][SQL] Pushdown LeftSemi/LeftAnti below join ## What changes were proposed in this pull request? This PR adds support for pushing down LeftSemi and LeftAnti joins below the Join operator. This is a prerequisite work thats needed for the subsequent task of moving the subquery rewrites to the beginning of optimization phase. The larger PR is [here](https://github.com/apache/spark/pull/23211) . This PR addresses the comment at [link](https://github.com/apache/spark/pull/23211#issuecomment-445705922). ## How was this patch tested? Added tests under LeftSemiAntiJoinPushDownSuite. Closes #24331 from dilipbiswal/SPARK-19712-pushleftsemi-belowjoin. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 20:30:20 +08:00
pengbo	54b0d1e0ef	[SPARK-27416][SQL] UnsafeMapData & UnsafeArrayData Kryo serialization … ## What changes were proposed in this pull request? Finish the rest work of https://github.com/apache/spark/pull/24317, https://github.com/apache/spark/pull/9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 13:03:00 +08:00
liwensun	26ed65f415	[SPARK-27453] Pass partitionBy as options in DataFrameWriter ## What changes were proposed in this pull request? Pass partitionBy columns as options and feature-flag this behavior. ## How was this patch tested? A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2019-04-16 15:03:16 -07:00
Liang-Chi Hsieh	b404e02574	[SPARK-27476][SQL] Refactoring SchemaPruning rule to remove duplicate code ## What changes were proposed in this pull request? In SchemaPruning rule, there is duplicate code for data source v1 and v2. Their logic is the same and we can refactor the rule to remove duplicate code. ## How was this patch tested? Existing tests. Closes #24383 from viirya/SPARK-27476. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-16 14:50:37 -07:00
pengbo	c58a4fed8d	[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes #24286 from pengbo/master. Lead-authored-by: pengbo <bo.peng1019@gmail.com> Co-authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-15 15:37:07 -07:00
Wenchen Fan	0407070945	[SPARK-27444][SQL] multi-select can be used in subquery ## What changes were proposed in this pull request? This is a regression caused by https://github.com/apache/spark/pull/24150 `select * from (from a select * select *)` is supported in 2.4, and we should keep supporting it. This PR merges the parser rule for single and multi select statements, as they are very similar. ## How was this patch tested? a new test case Closes #24348 from cloud-fan/parser. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 20:57:34 +08:00
Kris Mok	bbbe54aa79	[SPARK-27199][SQL][FOLLOWUP] Fix bug in codegen templates in UnixTime and FromUnixTime ## What changes were proposed in this pull request? SPARK-27199 introduced the use of `ZoneId` instead of `TimeZone` in a few date/time expressions. There were 3 occurrences of `ctx.addReferenceObj("zoneId", zoneId)` in that PR, which had a bug because while the `java.time.ZoneId` base type is public, the actual concrete implementation classes are not public, so using the 2-arg version of `CodegenContext.addReferenceObj` would incorrectly generate code that reference non-public types (`java.time.ZoneRegion`, to be specific). The 3-arg version should be used, with the class name of the referenced object explicitly specified to the public base type. One of such occurrences was caught in testing in the main PR of SPARK-27199 (https://github.com/apache/spark/pull/24141), for `DateFormatClass`. But the other 2 occurrences slipped through because there were no test cases that covered them. Example of this bug in the current Apache Spark master, in a Spark Shell: ``` scala> Seq(("2016-04-08", "yyyy-MM-dd")).toDF("s", "f").repartition(1).selectExpr("to_unix_timestamp(s, f)").show ... java.lang.IllegalAccessError: tried to access class java.time.ZoneRegion from class org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1 ``` This PR fixes the codegen issues and adds the corresponding unit tests. ## How was this patch tested? Enhanced tests in `DateExpressionsSuite` for `to_unix_timestamp` and `from_unixtime`. Closes #24352 from rednaxelafx/fix-spark-27199. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 13:31:18 +08:00
maryannxue	43da473c1c	[SPARK-27225][SQL] Implement join strategy hints ## What changes were proposed in this pull request? This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better. The hinted strategy will be used for the join with which it is associated if it is applicable/doable. Conflict resolving rules in case of multiple hints: 1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /+ merge(t1) / /+ broadcast(t1) / k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint. 2. Conflicts between two sides of the join: a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```. b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size. ## How was this patch tested? Added new UTs. Closes #24164 from maryannxue/join-hints. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 00:14:37 +08:00
chakravarthiT	074533334d	[SPARK-27088][SQL] Add a configuration to set log level for each batch at RuleExecutor ## What changes were proposed in this pull request? Similar to #22406 , which has made log level for plan changes by each rule configurable ,this PR is to make log level for plan changes by each batch configurable,and I have reused the same configuration: "spark.sql.optimizer.planChangeLog.level". Config proposed in this PR , spark.sql.optimizer.planChangeLog.batches - enable plan change logging only for a set of specified batches, separated by commas. ## How was this patch tested? Added UT , also tested manually and attached screenshots below. 1)Setting spark.sql.optimizer.planChangeLog.leve to warn. ![settingLogLevelToWarn](https://user-images.githubusercontent.com/45845595/54556730-8803dd00-49df-11e9-95ab-ebb0c8d735ef.png) 2)setting spark.sql.optimizer.planChangeLog.batches to Resolution and Subquery. ![settingBatchestoLog](https://user-images.githubusercontent.com/45845595/54556740-8cc89100-49df-11e9-80ab-fbbbe1ff2cdf.png) 3) plan change logging enabled only for a set of specified batches(Resolution and Subquery) ![batchloggingOp](https://user-images.githubusercontent.com/45845595/54556788-ab2e8c80-49df-11e9-9ae0-57815f552896.png) Closes #24136 from chakravarthiT/logBatches. Lead-authored-by: chakravarthiT <45845595+chakravarthiT@users.noreply.github.com> Co-authored-by: chakravarthiT <tcchakra@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-11 10:02:27 +09:00
ocaballero	181d190c60	[MINOR][SQL] Unnecessary access to externalCatalog Necessarily access the external catalog without having to do it ## What changes were proposed in this pull request? The existsFunction function has been changed because it unnecessarily accessed the externalCatalog to find if the database exists in cases where the function is in the functionRegistry ## How was this patch tested? It has been tested through spark-shell and accessing the metastore logs of hive. Inside spark-shell we use spark.table (% tableA%). SelectExpr ("trim (% columnA%)") in the current version and it appears every time: org.apache.hadoop.hive.metastore.HiveMetaStore.audit: cmd = get_database: default Once the change is made, no record appears Closes #24312 from OCaballero/master. Authored-by: ocaballero <oliver.caballero.alvarez@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-11 10:00:09 +09:00
Maxim Gekk	ab8710b579	[SPARK-27423][SQL] Cast DATE <-> TIMESTAMP according to the SQL standard ## What changes were proposed in this pull request? According to SQL standard, value of `DATE` type is union of year, month, dayInMonth, and it is independent from any time zones. To convert it to Catalyst's `TIMESTAMP`, `DATE` value should be "extended" by the time at midnight - `00:00:00`. The resulted local date+time should be considered as a timestamp in the session time zone, and casted to microseconds since epoch in `UTC` accordingly. The reverse casting from `TIMESTAMP` to `DATE` should be performed in the similar way. `TIMESTAMP` values should be represented as a local date+time in the session time zone. And the time component should be just removed. For example, `TIMESTAMP 2019-04-10 00:10:12` -> `DATE 2019-04-10`. The resulted date is converted to days since epoch `1970-01-01`. ## How was this patch tested? The changes were tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite` and `CastSuite`. Closes #24332 from MaxGekk/cast-timestamp-to-date2. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 22:41:19 +08:00
Maxim Gekk	1470f23ec9	[SPARK-27422][SQL] current_date() should return current date in the session time zone ## What changes were proposed in this pull request? In the PR, I propose to revert 2 commits `06abd06112` and `61561c1c2d`, and take current date via `LocalDate.now` in the session time zone. The result is stored as days since epoch `1970-01-01`. ## How was this patch tested? It was tested by `DateExpressionsSuite`, `DateFunctionsSuite`, `DateTimeUtilsSuite`, and `ComputeCurrentTimeSuite`. Closes #24330 from MaxGekk/current-date2. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 21:54:50 +08:00
韩田田00222924	85e5d4f141	[SPARK-24872] Replace taking the $symbol with $sqlOperator in BinaryOperator's toString method ## What changes were proposed in this pull request? For BinaryOperator's toString method, it's better to use `$sqlOperator` instead of `$symbol`. ## How was this patch tested? We can test this patch with unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21826 from httfighter/SPARK-24872. Authored-by: 韩田田00222924 <han.tiantian@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 16:58:01 +08:00
Wenchen Fan	2e90574dd0	[SPARK-27414][SQL] make it clear that date type is timezone independent ## What changes were proposed in this pull request? In SQL standard, date type is a union of the `year`, `month` and `day` fields. It's timezone independent, which means it does not represent a specific point in the timeline. Spark SQL follows the SQL standard, this PR is to make it clear that date type is timezone independent 1. improve the doc to highlight that date is timezone independent. 2. when converting string to date, uses the java time API that can directly parse a `LocalDate` from a string, instead of converting `LocalDate` to a `Instant` at UTC first. 3. when converting date to string, uses the java time API that can directly format a `LocalDate` to a string, instead of converting `LocalDate` to a `Instant` at UTC first. 2 and 3 should not introduce any behavior changes. ## How was this patch tested? existing tests Closes #24325 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 16:39:28 +08:00
Ryan Blue	58674d54ba	[SPARK-27181][SQL] Add public transform API ## What changes were proposed in this pull request? This adds a public Expression API that can be used to pass partition transformations to data sources. ## How was this patch tested? Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite. Closes #24117 from rdblue/add-public-transform-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 14:30:39 +08:00
Maxim Gekk	63e4bf42c2	[SPARK-27401][SQL] Refactoring conversion of Timestamp to/from java.sql.Timestamp ## What changes were proposed in this pull request? In the PR, I propose simpler implementation of `toJavaTimestamp()`/`fromJavaTimestamp()` by reusing existing functions of `DateTimeUtils`. This will allow to: - Simply implementation of `toJavaTimestamp()`, and handle properly negative inputs. - Detect `Long` overflow in conversion of milliseconds (`java.sql.Timestamp`) to microseconds (Catalyst's Timestamp). ## How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `CastSuite`. And by new benchmark for export/import timestamps added to `DateTimeBenchmark`: Before: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 290 335 49 17.2 58.0 1.0X Collect longs 1234 1681 487 4.1 246.8 0.2X Collect timestamps 1718 1755 63 2.9 343.7 0.2X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 283 301 19 17.7 56.6 1.0X Collect longs 1048 1087 36 4.8 209.6 0.3X Collect timestamps 1425 1479 56 3.5 285.1 0.2X ``` Closes #24311 from MaxGekk/conv-java-sql-date-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-09 15:42:27 -07:00
mingbo_pb	3e4cfe9dbc	[SPARK-27406][SQL] UnsafeArrayData serialization breaks when two machines have different Oops size ## What changes were proposed in this pull request? ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize endpoints. When the UnsafeArrayData is serialized with Java serialization, the BYTE_ARRAY_OFFSET in memory can change if two machines have different pointer width (Oops in JVM). This PR fixes this issue by using the same way in https://github.com/apache/spark/pull/9030 ## How was this patch tested? Manual test has been done in our tpcds environment and regarding unit test case has been added as well Closes #24317 from pengbo/SPARK-27406. Authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 15:41:42 +08:00
Hyukjin Kwon	f16dfb9129	[SPARK-27328][SQL] Add 'deprecated' in ExpressionDescription for extended usage and SQL doc ## What changes were proposed in this pull request? This PR proposes to two things: 1. Add `deprecated` field to `ExpressionDescription` so that it can be shown in our SQL function documentation (https://spark.apache.org/docs/latest/api/sql/), and it can be shown via `DESCRIBE FUNCTION EXTENDED`. 2. While I am here, add some more restrictions for `note()` and `since()`. Looks some documentations are broken due to malformed `note`: ![Screen Shot 2019-03-31 at 3 00 53 PM](https://user-images.githubusercontent.com/6477701/55285518-a3e88500-53c8-11e9-9e99-41d857794fbe.png) It should start with 4 spaces and end with a newline. I added some asserts, and fixed the instances together while I am here. This is technically a breaking change but I think it's too trivial to note somewhere (and we're in Spark 3.0.0). This PR adds `deprecated` property into `from_utc_timestamp` and `to_utc_timestamp` (it's deprecated as of #24195) as examples of using this field. Now it shows the deprecation information as below: - SQL documentation is shown as below: ![Screen Shot 2019-03-31 at 3 07 31 PM](https://user-images.githubusercontent.com/6477701/55285537-2113fa00-53c9-11e9-9932-f5693a03332d.png) - `DESCRIBE FUNCTION EXTENDED from_utc_timestamp;`: ``` Function: from_utc_timestamp Class: org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp Usage: from_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'. Extended Usage: Examples: > SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul'); 2016-08-31 09:00:00 Since: 1.5.0 Deprecated: Deprecated since 3.0.0. See SPARK-25496. ``` ## How was this patch tested? Manually tested via: - For documentation verification: ``` $ cd sql $ sh create-docs.sh ``` - For checking description: ``` $ ./bin/spark-sql ``` ``` spark-sql> DESCRIBE FUNCTION EXTENDED from_utc_timestamp; spark-sql> DESCRIBE FUNCTION EXTENDED to_utc_timestamp; ``` Closes #24259 from HyukjinKwon/SPARK-27328. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 13:49:42 +08:00
Gengliang Wang	d50603a37c	[SPARK-27271][SQL] Migrate Text to File Data Source V2 ## What changes were proposed in this pull request? Migrate Text source to File Data Source V2 ## How was this patch tested? Unit test Closes #24207 from gengliangwang/textV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-08 10:15:22 -07:00
Maxim Gekk	00241733a6	[SPARK-27405][SQL][TEST] Restrict the range of generated random timestamps ## What changes were proposed in this pull request? In the PR, I propose to restrict the range of random timestamp literals generated in `LiteralGenerator. timestampLiteralGen`. The generator creates instances of `java.sql.Timestamp` by passing milliseconds since epoch as `Long` type. Converting the milliseconds to microseconds can cause arithmetic overflow of Long type because Catalyst's Timestamp type stores microseconds since epoch in `Long` type internally as well. Proposed interval of random milliseconds is `[Long.MinValue / 1000, Long.MaxValue / 1000]`. For example, generated timestamp `new java.sql.Timestamp(-3948373668011580000)` causes `Long` overflow at the method: ```scala def fromJavaTimestamp(t: Timestamp): SQLTimestamp = { ... MILLISECONDS.toMicros(t.getTime()) + NANOSECONDS.toMicros(t.getNanos()) % NANOS_PER_MICROS ... } ``` because `t.getTime()` returns `-3948373668011580000` which is multiplied by `1000` at `MILLISECONDS.toMicros`, and the result `-3948373668011580000000` is less than `Long.MinValue`. ## How was this patch tested? By `DateExpressionsSuite` in the PR https://github.com/apache/spark/pull/24311 Closes #24316 from MaxGekk/random-timestamps-gen. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-08 09:53:00 -07:00
Dongjoon Hyun	982c4c8e3c	[SPARK-27390][CORE][SQL][TEST] Fix package name mismatch ## What changes were proposed in this pull request? This PR aims to clean up package name mismatches. ## How was this patch tested? Pass the Jenkins. Closes #24300 from dongjoon-hyun/SPARK-27390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 11:50:37 -07:00
gatorsmile	5678e687c6	[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- (1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes #24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-05 08:31:41 -07:00
Aayushmaan Jain	04e53d2e3c	[SPAR-27342][SQL] Optimize Limit 0 queries ## What changes were proposed in this pull request? With this change, unnecessary file scans are avoided in case of Limit 0 queries. I added a case (rule) to `PropagateEmptyRelation` to replace `GlobalLimit 0` and `LocalLimit 0` nodes with an empty `LocalRelation`. This prunes the subtree under the Limit 0 node and further allows other rules of `PropagateEmptyRelation` to optimize the Logical Plan - while remaining semantically consistent with the Limit 0 query. For instance: Query: `SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 LIMIT 0) AS table2 ON table1.id = table2.id` Optimized Plan without fix: ``` Join Inner, (id#79 = id#87) :- Filter isnotnull(id#79) : +- Relation[id#79,num1#80] parquet +- Filter isnotnull(id#87) +- GlobalLimit 0 +- LocalLimit 0 +- Relation[id#87,num2#88] parquet ``` Optimized Plan with fix: `LocalRelation <empty>, [id#75, num1#76, id#77, num2#78]` ## How was this patch tested? Added unit tests to verify Limit 0 optimization for: - Simple query containing Limit 0 - Inner Join, Left Outer Join, Right Outer Join, Full Outer Join queries containing Limit 0 as one of their children - Nested Inner Joins between 3 tables with one of them having a Limit 0 clause. - Intersect query wherein one of the subqueries was a Limit 0 query. Closes #24271 from aayushmaanjain/optimize-limit0. Authored-by: Aayushmaan Jain <aayushmaan.jain42@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-04 21:19:40 -07:00
Ruben Fiszel	0e44a51f2e	[SPARK-24345][SQL] Improve ParseError stop location when offending symbol is a token In the case where the offending symbol is a CommonToken, this PR increases the accuracy of the start and stop origin by leveraging the start and stop index information from CommonToken. Closes #21334 from rubenfiszel/patch-1. Lead-authored-by: Ruben Fiszel <rubenfiszel@gmail.com> Co-authored-by: rubenfiszel <rfiszel@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-04 18:20:34 -05:00
Dongjoon Hyun	b51763612a	Revert "[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not" This reverts commit `5888b15d9c`.	2019-04-03 09:41:13 -07:00
Wenchen Fan	ffb362a705	[SPARK-19712][SQL][FOLLOW-UP] reduce code duplication ## What changes were proposed in this pull request? abstract some common code into a method. ## How was this patch tested? existing tests Closes #24281 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 00:37:57 +08:00
Maxim Gekk	1bc672366d	[SPARK-27344][SQL][TEST] Support the LocalDate and Instant classes in Java Bean encoders ## What changes were proposed in this pull request? - Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`. - Updated comment for `Encoders.bean` - New Row getters: `getLocalDate` and `getInstant` - Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`. ## How was this patch tested? By `JavaBeanDeserializationSuite` Closes #24273 from MaxGekk/bean-instant-localdate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 17:45:59 +08:00
Dilip Biswal	3286bff942	[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. ## What changes were proposed in this pull request? In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA. Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing. Here is an example: ``` val df = spark.sql("select * from t where sum(ta) > 0") df.explain(true) df.show() ``` Resulting exception: ``` Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint)) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) at scala.Option.getOrElse(Option.scala:138) ``` Checked the behaviour of other database and all of them return an exception: Postgress ``` select * from foo where max(c1) > 0; Error ERROR: aggregate functions are not allowed in WHERE Position: 25 ``` DB2 ``` db2 => select * from foo where max(c1) > 0; SQL0120N Invalid use of an aggregate function or OLAP function. ``` Oracle ``` select * from foo where max(c1) > 0; ORA-00934: group function is not allowed here ``` MySql ``` select * from foo where max(c1) > 0; Invalid use of group function ``` Update This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid. ## How was this patch tested? Added tests in AnalysisErrorSuite and group-by.sql Closes #24209 from dilipbiswal/SPARK-27255. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 13:05:06 +08:00
Maxim Gekk	1d20d13149	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 10:55:56 +08:00
Dilip Biswal	b8b5acdd41	[SPARK-19712][SQL][FOLLOW-UP] Don't do partial pushdown when pushing down LeftAnti joins below Aggregate or Window operators. ## What changes were proposed in this pull request? After [23750](https://github.com/apache/spark/pull/23750), we may pushdown left anti joins below aggregate and window operators with a partial join condition. This is not correct and was pointed out by hvanhovell and cloud-fan [here](https://github.com/apache/spark/pull/23750#discussion_r270017097). This pr addresses their comments. ## How was this patch tested? Added two new tests to verify the behaviour. Closes #24253 from dilipbiswal/SPARK-19712-followup. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 09:56:27 +08:00
Hyukjin Kwon	949d712839	[SPARK-27346][SQL] Loosen the newline assert condition on 'examples' field in ExpressionInfo ## What changes were proposed in this pull request? I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem. However, this one line: `827383a97c/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java (L82)` made me to investigate a lot today. Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like, `5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (L146-L150)` to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`. I think this is not yet found because this particular codes are not released yet (see SPARK-26426). Looks just better to loosen the condition and forget about this stuff. This should be backported into branch-2.4 as well. ## How was this patch tested? N/A Closes #24274 from HyukjinKwon/SPARK-27346. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-03 08:27:41 +09:00
Sean Owen	d4420b455a	[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes #24241 from srowen/SPARK-27323. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-02 07:37:05 -07:00
Dongjoon Hyun	d575a453db	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `c5e83ab92c`.	2019-04-02 01:05:54 -07:00
Maxim Gekk	c5e83ab92c	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-02 10:20:06 +08:00
Liang-Chi Hsieh	eaf008ad0e	[SPARK-27329][SQL] Pruning nested field in map of map key and value from object serializers ## What changes were proposed in this pull request? If object serializer has map of map key/value, pruning nested field should work. Previously object serializer pruner don't recursively prunes nested fields if it is deeply located in map key or value. This patch proposed to address it by slightly factoring the pruning logic. ## How was this patch tested? Added tests. Closes #24260 from viirya/SPARK-27329. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 13:53:55 -07:00
Marco Gaido	5888b15d9c	[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not ## What changes were proposed in this pull request? When `GetMapValue` contains a foldable Map and a non-foldable key, `SimplifyExtractValueOps` fails to optimize it transforming it into case when statements. The PR adds a case for covering this situation too. ## How was this patch tested? added UT Closes #24223 from mgaido91/SPARK-27278. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 09:09:06 -07:00
Maxim Gekk	d332958109	[SPARK-27325][SQL] Add implicit encoders for LocalDate and Instant ## What changes were proposed in this pull request? Added implicit encoders for the `java.time.LocalDate` and `java.time.Instant` classes. This allows creation of datasets from instances of the types. ## How was this patch tested? Added new tests to `JavaDatasetSuite` and `DatasetSuite`. Closes #24249 from MaxGekk/instant-localdate-encoders. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 23:02:48 +08:00
Marco Gaido	8012f55a9b	[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes #23726 from mgaido91/SPARK-26812. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 22:22:10 +08:00
Takuya UESHIN	f176dd3f28	[SPARK-27314][SQL] Deduplicate exprIds for Union. ## What changes were proposed in this pull request? We have been having a potential problem with `Union` when the children have the same expression id in their outputs, which happens when self-union. ## How was this patch tested? Modified some tests to adjust plan changes. Closes #24236 from ueshin/issues/SPARK-27314/dedup_union. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-29 14:05:38 -07:00
Maxim Gekk	61561c1c2d	[SPARK-27252][SQL][FOLLOWUP] Calculate min and max days independently from time zone in ComputeCurrentTimeSuite ## What changes were proposed in this pull request? This fixes the `analyzer should replace current_date with literals` test in `ComputeCurrentTimeSuite` by making calculation of `min` and `max` days independent from time zone. ## How was this patch tested? by `ComputeCurrentTimeSuite`. Closes #24240 from MaxGekk/current-date-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-29 14:28:36 -05:00
Maxim Gekk	06abd06112	[SPARK-27252][SQL] Make current_date() independent from time zones ## What changes were proposed in this pull request? This makes the `CurrentDate` expression and `current_date` function independent from time zone settings. New result is number of days since epoch in `UTC` time zone. Previously, Spark shifted the current date (in `UTC` time zone) according the session time zone which violets definition of `DateType` - number of days since epoch (which is an absolute point in time, midnight of Jan 1 1970 in UTC time). The changes makes `CurrentDate` consistent to `CurrentTimestamp` which is independent from time zone too. ## How was this patch tested? The changes were tested by existing test suites like `DateExpressionsSuite`. Closes #24185 from MaxGekk/current-date. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-28 18:44:08 -07:00

1 2 3 4 5 ...

3607 commits