ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	2d871ad0e7	[SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo") ### What changes were proposed in this pull request? Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")` ### Why are the changes needed? Avoids ~50 deprecation warnings when attempting to build with 2.13. ### Does this PR introduce any user-facing change? None, should be no functional change at all. ### How was this patch tested? Existing tests. Closes #26061 from srowen/SPARK-29392. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:15:37 -07:00
gwang3	b3eba29493	[SPARK-29189][FOLLOW-UP][SQL] Beautify config name ### What changes were proposed in this pull request? Beautify comment ### Why are the changes needed? The config name now is pretty weird. ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test. Closes #26054 from wangshisan/SPARK-29189. Authored-by: gwang3 <gwang3@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 15:44:42 -07:00
Guilherme	de360e96d7	[SPARK-29336][SQL] Fix the implementation of QuantileSummaries.merge (guarantee that the relativeError will be respected) ### What changes were proposed in this pull request? Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug. ### Why are the changes needed? The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified. ### Does this PR introduce any user-facing change? Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example: ```scala var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } ``` In the current build it returns: ``` 16 12 10 11 17 ``` I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please? ### How was this patch tested? A new unit test was added to uncover the previous bug. Closes #26029 from sitegui/SPARK-29336. Authored-by: Guilherme <sitegui@sitegui.com.br> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-08 08:11:10 -05:00
Maxim Gekk	4e6d31f570	[SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default ### What changes were proposed in this pull request? Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`. ### Why are the changes needed? There is the agreement in the PR https://github.com/apache/spark/pull/21598#issuecomment-399695523 to change behavior in Spark 3.0. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select size(NULL); -1 ``` After: ```sql spark-sql> select size(NULL); NULL ``` ### How was this patch tested? By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples. Closes #26051 from MaxGekk/sizeof-null-returns-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-08 20:57:10 +09:00
Wenchen Fan	948a6e80fe	[SPARK-28892][SQL][FOLLOWUP] add resolved logical plan for UPDATE TABLE ### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later. ### Why are the changes needed? In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:36:26 -07:00
gwang3	64fe82b519	[SPARK-29189][SQL] Add an option to ignore block locations when listing file ### What changes were proposed in this pull request? In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable. And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds. To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations. ### Why are the changes needed? And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement. Table Size \| Total File Number \| Total Block Number \| List File Duration(With Block Location) \| List File Duration(Without Block Location) -- \| -- \| -- \| -- \| -- 22.6T \| 30000 \| 120000 \| 16.841s \| 1.730s 28.8 T \| 42001 \| 148964 \| 10.099s \| 2.858s 3.4 T \| 20000 \| 20000 \| 5.833s \| 4.881s ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Via ut. Closes #25869 from wangshisan/SPARK-29189. Authored-by: gwang3 <gwang3@ebay.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-10-07 14:52:55 -05:00
Maxim Gekk	b10344956d	[SPARK-29342][SQL] Make casting of string values to intervals case insensitive ### What changes were proposed in this pull request? In the PR, I propose to pass the `Pattern.CASE_INSENSITIVE` flag while compiling interval patterns in `CalendarInterval`. This makes casting string values to intervals case insensitive and tolerant to case of the `interval`, `year(s)`, `month(s)`, `week(s)`, `day(s)`, `hour(s)`, `minute(s)`, `second(s)`, `millisecond(s)` and `microsecond(s)`. ### Why are the changes needed? There are at least 2 reasons: - To maintain feature parity with PostgreSQL which is not sensitive to case: ```sql # select cast('10 Days' as INTERVAL); interval ---------- 10 days (1 row) ``` - Spark is tolerant to case of interval literals. Case insensitivity in casting should be convenient for Spark users. ```sql spark-sql> SELECT INTERVAL 1 YEAR 1 WEEK; interval 1 years 1 weeks ``` ### Does this PR introduce any user-facing change? Yes, current implementation produces `NULL` for `interval`, `year`, ... `microsecond` that are not in lower case. Before: ```sql spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL); NULL ``` After: ```sql spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL); interval 1 weeks 3 days ``` ### How was this patch tested? - by new tests in `CalendarIntervalSuite.java` - new test in `CastSuite` Closes #26010 from MaxGekk/interval-case-insensitive. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-07 09:33:01 -07:00
Maxim Gekk	18b7ad2fc5	[SPARK-29328][SQL] Fix calculation of mean seconds per month ### What changes were proposed in this pull request? I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year. ### Why are the changes needed? Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth. ### Does this PR introduce any user-facing change? Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases. Before: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.4516129 ``` After: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.45996838 ``` ### How was this patch tested? By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #25998 from MaxGekk/days-in-year. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 08:47:46 -05:00
Maxim Gekk	932e2619ce	[SPARK-29365][SQL] Support dates and timestamps subtraction ### What changes were proposed in this pull request? Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp. ### Why are the changes needed? - To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date: ```sql maxim=# select timestamp'now' - date'epoch'; ?column? ---------------------------- 18175 days 21:07:33.412875 (1 row) maxim=# select date'2020-01-01' - timestamp'now'; ?column? ------------------------- 86 days 02:52:00.945296 (1 row) ``` - To conform to the SQL standard which defines datetime subtraction as an interval. ### Does this PR introduce any user-facing change? Yes, currently the queries bellow fails with the error: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7; 'Project [unresolvedalias((1570385107234000 - 18170), None)] +- OneRowRelation ``` after the changes: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds ``` ### How was this patch tested? - Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite` - by 2 new test in `datetime.sql` Closes #26036 from MaxGekk/date-timestamp-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-07 16:47:00 +09:00
Maxim Gekk	eecef75350	[SPARK-29355][SQL] Support timestamps subtraction ### What changes were proposed in this pull request? Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`. ### Why are the changes needed? To maintain feature parity with PostgreSQL which allows to get timestamp difference: ```sql # select timestamp'today' - timestamp'yesterday'; ?column? ---------- 1 day (1 row) ``` ### Does this PR introduce any user-facing change? Yes, previously users got the following error from timestamp subtraction: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7; 'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)] +- OneRowRelation ``` after the changes they should get an interval: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; interval 1 days ``` ### How was this patch tested? - Added tests for `TimestampDiff` to `DateExpressionsSuite` - By new test in `TypeCoercionSuite`. - Enabled tests in `postgreSQL/timestamp.sql`. Closes #26022 from MaxGekk/timestamp-diff. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 09:39:19 -07:00
Wenchen Fan	275e044ba8	[SPARK-29039][SQL] centralize the catalog and table lookup logic ### What changes were proposed in this pull request? Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is 1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`. 2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`. However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places. This PR centralizes the catalog/table lookup logic by 3 rules. 1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command. 2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`. 3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore. ### Why are the changes needed? Reduce duplicated code and make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25747 from cloud-fan/lookup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 16:21:13 +08:00
Gengliang Wang	91747bd91b	[SPARK-29326][SQL] ANSI store assignment policy: throw exception on casting failure ### What changes were proposed in this pull request? 1. With ANSI store assignment policy, an exception is thrown on casting failure 2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`. ### Why are the changes needed? As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field. ### Does this PR introduce any user-facing change? With ANSI store assignment policy, an exception is thrown on casting failure ### How was this patch tested? Unit test Closes #25997 from gengliangwang/newCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 15:53:38 +08:00
Sean Owen	7aca0dd658	[SPARK-29296][BUILD][CORE] Remove use of .par to make 2.13 support easier; add scala-2.13 profile to enable pulling in par collections library separately, for the future ### What changes were proposed in this pull request? Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use. However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely. ### Why are the changes needed? To compile for 2.13 and later to work with 2.13. ### Does this PR introduce any user-facing change? Should not, no. ### How was this patch tested? Existing tests. Closes #25980 from srowen/SPARK-29296. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-03 08:56:08 -05:00
Henry D	51d6ba7490	[SPARK-28962][SQL] Provide index argument to filter lambda functions ### What changes were proposed in this pull request? Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`. ### Why are the changes needed? See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays. ### Does this PR introduce any user-facing change? Previously filter lambdas had to look like `filter(arr, el -> whatever)` Now, lambdas can take an index argument as well `filter(array, (el, idx) -> whatever)` ### How was this patch tested? I added unit tests to `HigherOrderFunctionsSuite`. Closes #25666 from henrydavidge/filter-idx. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 13:03:06 -07:00
Terry Kim	f2ead4d0b5	[SPARK-28970][SQL] Implement USE CATALOG/NAMESPACE for Data Source V2 ### What changes were proposed in this pull request? This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#) It also exposes `currentCatalog` in `CatalogManager`. Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog). ### Why are the changes needed? There is currently no mechanism to change current catalog/namespace thru SQL commands. ### Does this PR introduce any user-facing change? Yes, you can perform the following: ```scala // Sets the current catalog to 'testcat' spark.sql("USE CATALOG testcat") // Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'. spark.sql("USE ns1.ns2 IN testcat") // Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace. spark.sql("SHOW NAMESPACES") ``` ### How was this patch tested? Added new unit tests. Closes #25771 from imback82/use_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-02 21:55:21 +08:00
Maxim Gekk	e13880128d	[SPARK-29311][SQL] Return seconds with fraction from `date_part()` and `extract` ### What changes were proposed in this pull request? Added new expression `SecondWithFraction` which produces the `seconds` part of timestamps/dates with fractional part containing microseconds. This expression is used only in the `DatePart` expression. As the result, `date_part()` and `extract` return seconds and microseconds as the fractional part of the seconds part when `field` is `SECOND` (or synonyms). ### Why are the changes needed? The `date_part()` and `extract` were added to maintain feature parity with PostgreSQL which has different behavior for the `SECOND` value of the `field` parameter. The fix is needed to behave in the same way. Here is PostgreSQL's output: ```sql # SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001'); date_part ----------- 1.000001 (1 row) ``` ### Does this PR introduce any user-facing change? Yes, type of `date_part('SECOND', ...)` is changed from `INT` to `DECIMAL(8, 6)`. Before: ```sql spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001'); 1 ``` After: ```sql spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001'); 1.000001 ``` ### How was this patch tested? - Added new tests to `DateExpressionSuite` for the `SecondWithFraction` expression - Regenerated results of `date_part.sql`, `extract.sql` and `timestamp.sql` - Updated results of `ExtractBenchmark` Closes #25986 from MaxGekk/extract-seconds-from-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-02 11:16:31 +09:00
Dongjoon Hyun	bd031c2173	[SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings ### What changes were proposed in this pull request? This PR aims to remove `scalatest` deprecation warnings with the following changes. - `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar` - `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser` - `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers` - `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks` ### Why are the changes needed? According to the Jenkins logs, there are 118 warnings about this. ``` grep "is deprecated" ~/consoleText \| grep scalatest \| wc -l 118 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? After Jenkins passes, we need to check the Jenkins log. Closes #25982 from dongjoon-hyun/SPARK-29307. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 21:00:11 -07:00
Dongjoon Hyun	a0b3d7a323	[SPARK-29300][TESTS] Compare `catalyst` and `avro` module benchmark in JDK8/11 ### What changes were proposed in this pull request? This PR regenerate the benchmark results in `catalyst` and `avro` module in order to compare JDK8/JDK11 result. ### Why are the changes needed? This PR aims to verify that there is no regression on JDK11. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only update. We need to run the benchmark manually. Closes #25972 from dongjoon-hyun/SPARK-29300. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 17:59:43 -07:00
Sean Owen	e1ea806b30	[SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13 ### What changes were proposed in this pull request? Scala 2.13 emits a deprecation warning for procedure-like declarations: ``` def foo() { ... ``` This is equivalent to the following, so should be changed to avoid a warning: ``` def foo(): Unit = { ... ``` ### Why are the changes needed? It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files. Unfortunately, that makes this quite a big change. ### Does this PR introduce any user-facing change? No behavior change at all. ### How was this patch tested? Existing tests. Closes #25968 from srowen/SPARK-29291. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 10:03:23 -07:00
Jungtaek Lim (HeartSaVioR)	39eb79ac4b	[SPARK-28074][SS] Log warn message on possible correctness issue for multiple stateful operations in single query ## What changes were proposed in this pull request? Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch. This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users. This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative. ## How was this patch tested? Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query. Closes #24890 from HeartSaVioR/SPARK-28074. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-30 08:18:23 -05:00
Liang-Chi Hsieh	dd92e15301	[SPARK-29186][SQL] AliasIdentifier should be converted to Json in prettyJson ### What changes were proposed in this pull request? This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson. ### Why are the changes needed? When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias", "num-children" : 1, "name" : null, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", ... ``` Seems the alias name was in the Json before SPARK-19602. It is fixed by this patch: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias", "num-children" : 1, "name" : { "product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier", "identifier" : "t1" }, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", ... ``` ### Does this PR introduce any user-facing change? Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier. ### How was this patch tested? Added unit test. Closes #25959 from viirya/SPARK-29186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-29 20:00:13 -07:00
Yuming Wang	31700116d2	[SPARK-28476][SQL] Support ALTER DATABASE SET LOCATION ### What changes were proposed in this pull request? Support the syntax of `ALTER (DATABASE\|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax. Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL https://issues.apache.org/jira/browse/HIVE-8472 ### Why are the changes needed? Support more syntax. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25883 from wangyum/SPARK-28476. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-29 11:31:49 -07:00
Jungtaek Lim (HeartSaVioR)	94946e4836	[SPARK-29281][SQL] Correct example of Like/RLike to test the origin intention correctly ### What changes were proposed in this pull request? This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default. Please take a look at current example of Like: `d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)` If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail. ``` The escape character is '\'. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character. ``` For the query ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%'; ``` SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike): > LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%` which are no longer having origin intention (see left). Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%'; ``` > LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%` Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`. Same for RLIKE: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` which is OK, but ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.'; ``` > RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.` which no longer haves origin intention. Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` ### Why are the changes needed? Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue. ### Does this PR introduce any user-facing change? No, as it only corrects documentation. ### How was this patch tested? Added debug log (like above) and ran queries from `spark-sql`. Closes #25957 from HeartSaVioR/SPARK-29281. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 03:05:49 +09:00
Maxim Gekk	4dd0066d40	[SPARK-21914][SQL][TESTS] Check results of expression examples ### What changes were proposed in this pull request? New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different. ### Why are the changes needed? This prevents mistakes in expression examples, and fixes existing mistakes in comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new test to `SQLQuerySuite`. Closes #25942 from MaxGekk/run-expr-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-27 21:30:37 +09:00
Yuanjian Li	ada3ad34c6	[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable ### What changes were proposed in this pull request? Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror. ### Why are the changes needed? We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars, end-users can set this config if the default maven central repo is unreachable. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #25849 from xuanyuanking/SPARK-29175. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 20:57:44 -07:00
Gengliang Wang	a1213d5f96	[SPARK-28997][SQL] Add `spark.sql.dialect` ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 21:00:27 +08:00
Burak Yavuz	c8159c7941	[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter ### What changes were proposed in this pull request? It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR. ### Why are the changes needed? Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark. ### Does this PR introduce any user-facing change? It changes the default save mode for V2 Tables in the DataFrameWriter APIs ### How was this patch tested? Existing tests Closes #25876 from brkyvz/removeSM. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 15:20:04 +08:00
Liang-Chi Hsieh	b8b59d6fa3	[SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression ### What changes were proposed in this pull request? This patch proposes to skip PlanExpression when doing subexpression elimination on executors. ### Why are the changes needed? Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields. The NPE looks like: ``` [info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression * FAILED * (175 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException [info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548) [info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36) [info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added unit test. Closes #25925 from viirya/SPARK-29239. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 13:55:01 +08:00
Maxim Gekk	21db2f86f7	[SPARK-29237][SQL] Prevent real function names in expression example template ### What changes were proposed in this pull request? In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples. ### Why are the changes needed? Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`. Closes #25924 from MaxGekk/fix-func-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-25 15:16:00 -07:00
Wenchen Fan	a36a7235db	[SPARK-29215][SQL] current namespace should be tracked in SessionCatalog if the current catalog is session catalog ### What changes were proposed in this pull request? when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`. ### Why are the changes needed? It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`. Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog. Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Newly added and updated test cases. Closes #25903 from cloud-fan/current. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-09-25 17:01:36 +08:00
Xiao Li	7c02c143aa	[SPARK-28292][SQL] Enable Injection of User-defined Hint ### What changes were proposed in this pull request? Move the rule `RemoveAllHints` after the batch `Resolution`. ### Why are the changes needed? User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test case Closes #25746 from gatorsmile/moveRemoveAllHints. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 18:04:17 +08:00
sheepstop	81de9d3c29	[SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python ### What changes were proposed in this pull request? Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component ### Why are the changes needed? It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios. ### Does this PR introduce any user-facing change? Yes, more info provided to user. ### How was this patch tested? No tests added, only doc change. Closes #25704 from sheepstop/master. Authored-by: sheepstop <yangting617@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-24 18:57:54 +09:00
windpiger	da7e5c4ffb	[SPARK-19917][SQL] qualified partition path stored in catalog ## What changes were proposed in this pull request? partition path should be qualified to store in catalog. There are some scenes: 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' should be qualified: file:/path/x Hive 2.0.0 does not support for location without schema here. ``` FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. {0} is not absolute or has no scheme information. Please specify a complete absolute uri with scheme information. ``` 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' should be qualified: file:/tablelocation/x Hive 2.0.0 does not support for relative location here. 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' should be qualified: file:/path/x the same with Hive 2.0.0 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' should be qualified: file:/tablelocation/x the same with Hive 2.0.0 Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it. Another change is for alter table location. ## How was this patch tested? add / modify existing TestCases Closes #17254 from windpiger/qualifiedPartitionPath. Authored-by: windpiger <songjun@outlook.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 14:48:47 +08:00
xy_xin	655356e825	[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan ### What changes were proposed in this pull request? This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement: ``` UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate? ``` ### Why are the changes needed? With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New test cases added. Closes #25626 from xianyinxin/SPARK-28892. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-23 19:25:56 +08:00
angerszhu	fe4bee8fd8	[SPARK-29162][SQL] Simplify NOT(IsNull(x)) and NOT(IsNotNull(x)) ### What changes were proposed in this pull request? Rewrite ``` NOT isnull(x) -> isnotnull(x) NOT isnotnull(x) -> isnull(x) ``` ### Why are the changes needed? Make LogicalPlan more readable and useful for query canonicalization. Make same condition equal when judge query canonicalization equal ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Newly added UTs. Closes #25878 from AngersZhuuuu/SPARK-29162. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 11:17:47 -07:00
Maxim Gekk	051e691029	[SPARK-28141][SQL] Support special date values ### What changes were proposed in this pull request? Supported special string values for `DATE` type. They are simply notational shorthands that will be converted to ordinary date values when read. The following string values are supported: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html) ### Does this PR introduce any user-facing change? Previously, the parser fails on the special values with the error: ```sql spark-sql> select date 'today'; Error in query: Cannot parse the DATE value: today(line 1, pos 7) ``` After the changes, the special values are converted to appropriate dates: ```sql spark-sql> select date 'today'; 2019-09-06 ``` ### How was this patch tested? - Added tests to `DateFormatterSuite` to check parsing special values from regular strings. - Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String` - Uncommented tests in `date.sql` Closes #25708 from MaxGekk/datetime-special-values. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-22 17:31:33 +09:00
Maxim Gekk	89bad267d4	[SPARK-29200][SQL] Optimize `extract`/`date_part` for epoch ### What changes were proposed in this pull request? Refactoring of the `DateTimeUtils.getEpoch()` function by avoiding decimal operations that are pretty expensive, and converting the final result to the decimal type at the end. ### Why are the changes needed? The changes improve performance of the `getEpoch()` method at least up to 20 times. Before: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 256 277 33 39.0 25.6 1.0X EPOCH of timestamp 23455 23550 131 0.4 2345.5 0.0X ``` After: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 255 294 34 39.2 25.5 1.0X EPOCH of timestamp 1049 1054 9 9.5 104.9 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test from `DateExpressionSuite`. Closes #25881 from MaxGekk/optimize-extract-epoch. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-22 16:59:59 +09:00
Maxim Gekk	3be5741029	[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` ### What changes were proposed in this pull request? Changed the `DateTimeUtils.getMilliseconds()` by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type. ### Why are the changes needed? This improves performance of `extract` and `date_part()` by more than 50 times: Before: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 397 428 45 25.2 39.7 1.0X MILLISECONDS of timestamp 36723 36761 63 0.3 3672.3 0.0X ``` After: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 278 284 6 36.0 27.8 1.0X MILLISECONDS of timestamp 592 606 13 16.9 59.2 0.5X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suite - `DateExpressionsSuite` Closes #25871 from MaxGekk/optimize-epoch-millis. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-21 21:11:31 -07:00
Jungtaek Lim (HeartSaVioR)	f7cc695808	[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions ### What changes were proposed in this pull request? This patch fixes the issue brought by [SPARK-21870](http://issues.apache.org/jira/browse/SPARK-21870): when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as `byte[]`, but Spark create the code for BinaryType as `[B` and generated code fails compilation. Below is the generated code which failed compilation (Line 380): ``` /* 380 / private void agg_doAggregate_count_0([B agg_expr_1_1, boolean agg_exprIsNull_1_1, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_1) throws java.io.IOException { / 381 / // evaluate aggregate function for count / 382 / boolean agg_isNull_26 = false; / 383 / long agg_value_28 = -1L; / 384 / if (!false && agg_exprIsNull_1_1) { / 385 / long agg_value_31 = agg_unsafeRowAggBuffer_1.getLong(1); / 386 / agg_isNull_26 = false; / 387 / agg_value_28 = agg_value_31; / 388 / } else { / 389 / long agg_value_33 = agg_unsafeRowAggBuffer_1.getLong(1); / 390 / / 391 / long agg_value_32 = -1L; / 392 / / 393 / agg_value_32 = agg_value_33 + 1L; / 394 / agg_isNull_26 = false; / 395 / agg_value_28 = agg_value_32; / 396 / } / 397 / // update unsafe row buffer / 398 / agg_unsafeRowAggBuffer_1.setLong(1, agg_value_28); / 399 */ } ``` There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky. ### Why are the changes needed? Without the fix, generated code from HashAggregateExec may fail compilation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new UT. Without the fix, newly added UT fails. Closes #25830 from HeartSaVioR/SPARK-29140. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-21 16:29:23 +09:00
Maxim Gekk	252b6cf3c9	[SPARK-29187][SQL] Return null from `date_part()` for the null `field` ### What changes were proposed in this pull request? In the PR, I propose to change behavior of the `date_part()` function in handling `null` field, and make it the same as PostgreSQL has. If `field` parameter is `null`, the function should return `null` of the `double` type as PostgreSQL does: ```sql # select date_part(null, date '2019-09-20'); date_part ----------- (1 row) # select pg_typeof(date_part(null, date '2019-09-20')); pg_typeof ------------------ double precision (1 row) ``` ### Why are the changes needed? The `date_part()` function was added to maintain feature parity with PostgreSQL but current behavior of the function is different in handling null as `field`. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select date_part(null, date'2019-09-20'); Error in query: null; line 1 pos 7 ``` After: ```sql spark-sql> select date_part(null, date'2019-09-20'); NULL ``` ### How was this patch tested? Add new tests to `DateFunctionsSuite for 2 cases: - `field` = `null`, `source` = a date literal - `field` = `null`, `source` = a date column Closes #25865 from MaxGekk/date_part-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-20 20:28:56 -07:00
Takeshi Yamamuro	ec8a1a8e88	[SPARK-29122][SQL] Propagate all the SQL conf to executors in SQLQueryTestSuite ### What changes were proposed in this pull request? This pr is to propagate all the SQL configurations to executors in `SQLQueryTestSuite`. When the propagation enabled in the tests, a potential bug below becomes apparent; ``` CREATE TABLE num_data (id int, val decimal(38,10)) USING parquet; .... select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4): QueryOutput(select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4),struct<>,java.lang.IllegalArgumentException [info] requirement failed: MutableProjection cannot use UnsafeRow for output data types: decimal(38,0)) (SQLQueryTestSuite.scala:380) ``` The root culprit is that `InterpretedMutableProjection` has incorrect validation in the interpreter mode: `validExprs.forall { case (e, _) => UnsafeRow.isFixedLength(e.dataType) }`. This validation should be the same with the condition (`isMutable`) in `HashAggregate.supportsAggregate`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126 ### Why are the changes needed? Bug fixes. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added tests in `AggregationQuerySuite` Closes #25831 from maropu/SPARK-29122. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-20 21:41:09 +09:00
Ryan Blue	2c775f418f	[SPARK-28612][SQL] Add DataFrameWriterV2 API ## What changes were proposed in this pull request? This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API: * Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans. * Only creates v2 logical plans so the behavior is always consistent. * Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`. Here are a few example uses of the new API: ```scala df.writeTo("catalog.db.table").append() df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01") df.writeTo("catalog.db.table").overwritePartitions() df.writeTo("catalog.db.table").asParquet.create() df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace() df.writeTo("catalog.db.table").using("abc").replace() ``` ## How was this patch tested? Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans. Closes #25681 from rdblue/SPARK-28612-add-data-frame-writer-v2. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-09-19 13:32:09 -07:00
Jungtaek Lim (HeartSaVioR)	eee2e026bb	[SPARK-29165][SQL][TEST] Set log level of log generated code as ERROR in case of compile error on generated code in UT ### What changes were proposed in this pull request? This patch proposes to change the log level of logging generated code in case of compile error being occurred in UT. This would help to investigate compilation issue of generated code easier, as currently we got exception message of line number but there's no generated code being logged actually (as in most cases of UT the threshold of log level is at least WARN). ### Why are the changes needed? This would help investigating issue on compilation error for generated code in UT. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25835 from HeartSaVioR/MINOR-always-log-generated-code-on-fail-to-compile-in-unit-testing. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 11:47:47 -07:00
Sean Owen	c5d8a51f3b	[MINOR][BUILD] Fix about 15 misc build warnings ### What changes were proposed in this pull request? This addresses about 15 miscellaneous warnings that appear in the current build. ### Why are the changes needed? No functional changes, it just slightly reduces the amount of extra warning output. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests, run manually. Closes #25852 from srowen/BuildWarnings. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 11:37:42 -07:00
Gengliang Wang	b917a6593d	[SPARK-28989][SQL] Add a SQLConf `spark.sql.ansi.enabled` ### What changes were proposed in this pull request? Currently, there are new configurations for compatibility with ANSI SQL: * `spark.sql.parser.ansi.enabled` * `spark.sql.decimalOperations.nullOnOverflow` * `spark.sql.failOnIntegralTypeOverflow` This PR is to add new configuration `spark.sql.ansi.enabled` and remove the 3 options above. When the configuration is true, Spark tries to conform to the ANSI SQL specification. It will be disabled by default. ### Why are the changes needed? Make it simple and straightforward. ### Does this PR introduce any user-facing change? The new features for ANSI compatibility will be set via one configuration `spark.sql.ansi.enabled`. ### How was this patch tested? Existing unit tests. Closes #25693 from gengliangwang/ansiEnabled. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-18 22:30:28 -07:00
Yuming Wang	8c3f27ceb4	[SPARK-28683][BUILD] Upgrade Scala to 2.12.10 ## What changes were proposed in this pull request? This PR upgrade Scala to 2.12.10. Release notes: - Fix regression in large string interpolations with non-String typed splices - Revert "Generate shallower ASTs in pattern translation" - Fix regression in classpath when JARs have 'a.b' entries beside 'a/b' - Faster compiler: 5–10% faster since 2.12.8 - Improved compatibility with JDK 11, 12, and 13 - Experimental support for build pipelining and outline type checking More details: https://github.com/scala/scala/releases/tag/v2.12.10 https://github.com/scala/scala/releases/tag/v2.12.9 ## How was this patch tested? Existing tests Closes #25404 from wangyum/SPARK-28683. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 13:30:36 -07:00
John Zhuge	ee94b5d701	[SPARK-29030][SQL] Simplify lookupV2Relation ## What changes were proposed in this pull request? Simplify the return type for `lookupV2Relation` which makes the 3 callers more straightforward. ## How was this patch tested? Existing unit tests. Closes #25735 from jzhuge/lookupv2relation. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-09-18 09:27:11 -07:00
sandeep katta	376e17c082	[SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected ### What changes were proposed in this pull request? #DataSet fruit,color,price,quantity apple,red,1,3 banana,yellow,2,4 orange,orange,3,5 xxx This PR aims to fix the below ``` scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count res1: Long = 4 ``` This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645). SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387) ### Why are the changes needed? SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387 ### Does this PR introduce any user-facing change? No, ### How was this patch tested? Added UT, and also tested the bug SPARK-24645 SPARK-24645 regression ![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png) Closes #25820 from sandeep-katta/SPARK-29101. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:33:13 +09:00
Maxim Gekk	c2734ab1fc	[SPARK-29012][SQL] Support special timestamp values ### What changes were proposed in this pull request? Supported special string values for `TIMESTAMP` type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` -midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html) ### Does this PR introduce any user-facing change? Previously, the parser fails on the special values with the error: ```sql spark-sql> select timestamp 'today'; Error in query: Cannot parse the TIMESTAMP value: today(line 1, pos 7) ``` After the changes, the special values are converted to appropriate dates: ```sql spark-sql> select timestamp 'today'; 2019-09-06 00:00:00 ``` ### How was this patch tested? - Added tests to `TimestampFormatterSuite` to check parsing special values from regular strings. - Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String` - Uncommented tests in `timestamp.sql` Closes #25716 from MaxGekk/timestamp-special-values. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:30:59 +09:00
Gengliang Wang	3da2786dc6	[SPARK-29096][SQL] The exact math method should be called only when there is a corresponding function in Math ### What changes were proposed in this pull request? 1. After https://github.com/apache/spark/pull/21599, if the option "spark.sql.failOnIntegralTypeOverflow" is enabled, all the Binary Arithmetic operator will used the exact version function. However, only `Add`/`Substract`/`Multiply` has a corresponding exact function in java.lang.Math . When the option "spark.sql.failOnIntegralTypeOverflow" is enabled, a runtime exception "BinaryArithmetics must override either exactMathMethod or genCode" is thrown if the other Binary Arithmetic operators are used, such as "Divide", "Remainder". The exact math method should be called only when there is a corresponding function in `java.lang.Math` 2. Revise the log output of casting to `Int`/`Short` 3. Enable `spark.sql.failOnIntegralTypeOverflow` for pgSQL tests in `SQLQueryTestSuite`. ### Why are the changes needed? 1. Fix the bugs of https://github.com/apache/spark/pull/21599 2. The test case of pgSQL intends to check the overflow of integer/long type. We should enable `spark.sql.failOnIntegralTypeOverflow`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #25804 from gengliangwang/enableIntegerOverflowInSQLTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-18 16:59:17 +08:00
s71955	4559a82a1d	[SPARK-28930][SQL] Last Access Time value shall display 'UNKNOWN' in all clients What changes were proposed in this pull request? Issue 1 : modifications not required as these are different formats for the same info. In the case of a Spark DataFrame, null is correct. Issue 2 mentioned in JIRA Spark SQL "desc formatted tablename" is not showing the header # col_name,data_type,comment , seems to be the header has been removed knowingly as part of SPARK-20954. Issue 3: Corrected the Last Access time, the value shall display 'UNKNOWN' as currently system wont support the last access time evaluation, since hive was setting Last access time as '0' in metastore even though spark CatalogTable last access time value set as -1. this will make the validation logic of LasAccessTime where spark sets 'UNKNOWN' value if last access time value set as -1 (means not evaluated). Does this PR introduce any user-facing change? No How was this patch tested? Locally and corrected a ut. Attaching the test report below ![SPARK-28930](https://user-images.githubusercontent.com/12999161/64484908-83a1d980-d236-11e9-8062-9facf3003e5e.PNG) Closes #25720 from sujith71955/master_describe_info. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 12:54:44 +09:00
Chris Martin	05988b256e	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs ### What changes were proposed in this pull request? Adds a new cogroup Pandas UDF. This allows two grouped dataframes to be cogrouped together and apply a (pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame UDF to each cogroup. Example usage ``` from pyspark.sql.functions import pandas_udf, PandasUDFType df1 = spark.createDataFrame( [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) pandas_udf("time int, id int, v1 double, v2 string", PandasUDFType.COGROUPED_MAP) def asof_join(l, r): return pd.merge_asof(l, r, on="time", by="id") df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() ``` +--------+---+---+---+ \| time\| id\| v1\| v2\| +--------+---+---+---+ \|20000101\| 1\|1.0\| x\| \|20000102\| 1\|3.0\| x\| \|20000101\| 2\|2.0\| y\| \|20000102\| 2\|4.0\| y\| +--------+---+---+---+ ### How was this patch tested? Added unit test test_pandas_udf_cogrouped_map Closes #24981 from d80tb7/SPARK-27463-poc-arrow-stream. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-09-17 17:13:50 -07:00
xy_xin	3fc52b5557	[SPARK-28950][SQL] Refine the code of DELETE ### What changes were proposed in this pull request? This pr refines the code of DELETE, including, 1, make `whereClause` to be optional, in which case DELETE will delete all of the data of a table; 2, add more test cases; 3, some other refines. This is a following-up of SPARK-28351. ### Why are the changes needed? An optional where clause in DELETE respects the SQL standard. ### Does this PR introduce any user-facing change? Yes. But since this is a non-released feature, this change does not have any end-user affects. ### How was this patch tested? New case is added. Closes #25652 from xianyinxin/SPARK-28950. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-18 01:14:14 +08:00
Maxim Gekk	db996ccad9	[SPARK-29074][SQL] Optimize `date_format` for foldable `fmt` ### What changes were proposed in this pull request? In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable. ### Why are the changes needed? The changes improve performance of the `date_format()` function. Before: ``` format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ format date wholestage off 7180 / 7181 1.4 718.0 1.0X format date wholestage on 7051 / 7194 1.4 705.1 1.0X ``` After: ``` format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ format date wholestage off 4787 / 4839 2.1 478.7 1.0X format date wholestage on 4736 / 4802 2.1 473.6 1.0X ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #25782 from MaxGekk/date_format-foldable. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-17 16:00:10 +09:00
Liang-Chi Hsieh	dffd92e977	[SPARK-29100][SQL] Fix compilation error in codegen with switch from InSet expression ### What changes were proposed in this pull request? When InSet generates Java switch-based code, if the input set is empty, we don't generate switch condition, but a simple expression that is default case of original switch. ### Why are the changes needed? SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error: ``` [info] - SPARK-29100: InSet with empty input set * FAILED * (58 milliseconds) [info] Code generation of input[0, int, true] INSET () failed: [info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1 [info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1 [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383) ``` ### Does this PR introduce any user-facing change? Yes. Previously, when users have InSet against an empty set, generated code causes compilation error. This patch fixed it. ### How was this patch tested? Unit test added. Closes #25806 from viirya/SPARK-29100. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-17 11:06:10 +08:00
Takeshi Yamamuro	95073fb62b	[SPARK-29008][SQL] Define an individual method for each common subexpression in HashAggregateExec ### What changes were proposed in this pull request? This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; `4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)` The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`, the current master generates; ``` /* 098 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException { / 099 / // do aggregate / 100 / // common sub-expressions / 101 / int agg_value_6 = -1; / 102 / / 103 / agg_value_6 = agg_expr_0_0 + agg_expr_1_0; / 104 / / 105 / int agg_value_5 = -1; / 106 / / 107 / agg_value_5 = agg_value_6 + agg_expr_2_0; / 108 / boolean agg_isNull_4 = false; / 109 / long agg_value_4 = -1L; / 110 / if (!false) { / 111 / agg_value_4 = (long) agg_value_5; / 112 / } / 113 / int agg_value_10 = -1; / 114 / / 115 / agg_value_10 = agg_expr_0_0 + agg_expr_1_0; / 116 / // evaluate aggregate functions and update aggregation buffers / 117 / agg_doAggregate_sum_0(agg_value_10); / 118 / agg_doAggregate_avg_0(agg_value_4, agg_isNull_4); / 119 / / 120 / } ``` On the other hand, this pr generates; ``` / 121 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException { / 122 / // do aggregate / 123 / // common sub-expressions / 124 / long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0); / 125 / int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0); / 126 / // evaluate aggregate functions and update aggregation buffers / 127 / agg_doAggregate_sum_0(agg_subExprValue_1); / 128 / agg_doAggregate_avg_0(agg_subExprValue_0); / 129 / / 130 / } ``` I run some micro benchmarks for this pr; ``` (base) maropu~:$system_profiler SPHardwareDataType Hardware: Hardware Overview: Processor Name: Intel Core i5 Processor Speed: 2 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Memory: 8 GB (base) maropu~:$java -version java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) (base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v val numCols = 40 val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" } spark.range(3000000).selectExpr(colExprs: _).createOrReplaceTempView("t") val aggExprs = (2 until numCols).map { i => (0 until i).map(d => s"_c$d") .mkString("AVG(", " + ", ")") } // Drops the time of a first run then pick that of a second run timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() } // the master maxCodeGen: 12957 Elapsed time: 36.309858661s // this pr maxCodeGen=4184 Elapsed time: 2.399490285s ``` ### Why are the changes needed? To avoid the too-long-function issue in JVMs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `WholeStageCodegenSuite` Closes #25710 from maropu/SplitSubexpr. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-17 11:09:55 +09:00
Takeshi Yamamuro	6297287dfa	[SPARK-29061][SQL] Prints bytecode statistics in debugCodegen ### What changes were proposed in this pull request? This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following; ``` scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) == ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L]) +- (1) LocalTableScan [v#0] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 */ } ... ``` ### Why are the changes needed? For efficient developments ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested Closes #25766 from maropu/PrintBytecodeStats. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-16 21:48:07 +08:00
Wenchen Fan	1b99d0cca4	[SPARK-29069][SQL] ResolveInsertInto should not do table lookup ### What changes were proposed in this pull request? It's more clear to only do table lookup in `ResolveTables` rule (for v2 tables) and `ResolveRelations` rule (for v1 tables). `ResolveInsertInto` should only resolve the `InsertIntoStatement` with resolved relations. ### Why are the changes needed? to make the code simpler ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25774 from cloud-fan/simplify. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-16 09:46:34 +09:00
changchun.wang	b91648cfd0	[SPARK-28856][FOLLOW-UP][SQL][TEST] Add the `namespaces` keyword to TableIdentifierParserSuite ### What changes were proposed in this pull request? This PR add the `namespaces` keyword to `TableIdentifierParserSuite`. ### Why are the changes needed? Improve the test. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25758 from highmoutain/3.0bugfix. Authored-by: changchun.wang <251922566@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:11:38 -07:00
Jungtaek Lim (HeartSaVioR)	61e5aebce3	[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping ### What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch. Closes #25790 from HeartSaVioR/SPARK-29046-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:04:56 -07:00
Maxim Gekk	1b7afc0c98	[SPARK-28471][SQL][DOC][FOLLOWUP] Fix year patterns in the comments of date-time expressions ### What changes were proposed in this pull request? In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one. ### Why are the changes needed? To make comments consistent to implementations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Scala Style checker. Closes #25796 from MaxGekk/year-pattern-uuuu-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:02:15 -07:00
Dongjoon Hyun	13b77e52d2	Revert "[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping" This reverts commit `850833fa17`.	2019-09-14 00:09:45 -07:00
Wenchen Fan	053dd858d3	[SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes ### What changes were proposed in this pull request? reorganize the packages of DS v2 interfaces/classes: 1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes 2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes 3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes 4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes ### Why are the changes needed? Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25700 from cloud-fan/package. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-12 19:59:34 +08:00
Jungtaek Lim (HeartSaVioR)	850833fa17	[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping # What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE. Closes #25753 from HeartSaVioR/SPARK-29046. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 11:16:33 +09:00
Wenchen Fan	eec728a0d4	[SPARK-29057][SQL] remove InsertIntoTable ### What changes were proposed in this pull request? Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement` ### Why are the changes needed? `InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes: 1. parser creates `InsertIntoStatement` 2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands. 3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25763 from cloud-fan/remove. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 09:24:36 +09:00
Mick Jermsurawong	fa75db2059	[SPARK-29026][SQL] Improve error message in `schemaFor` in trait without companion object constructor ### What changes were proposed in this pull request? - For trait without companion object constructor, currently the method to get constructor parameters `constructParams` in `ScalaReflection` will throw exception. ``` scala.ScalaReflectionException: <none> is not a term at scala.reflect.api.Symbols$SymbolApi.asTerm(Symbols.scala:211) at scala.reflect.api.Symbols$SymbolApi.asTerm$(Symbols.scala:211) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:106) at org.apache.spark.sql.catalyst.ScalaReflection.getCompanionConstructor(ScalaReflection.scala:909) at org.apache.spark.sql.catalyst.ScalaReflection.constructParams(ScalaReflection.scala:914) at org.apache.spark.sql.catalyst.ScalaReflection.constructParams$(ScalaReflection.scala:912) at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:47) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:890) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:886) at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:47) ``` - Instead this PR would throw exception: ``` Unable to find constructor for type [XXX]. This could happen if [XXX] is an interface or a trait without companion object constructor UnsupportedOperationException: ``` In the normal usage of ExpressionEncoder, this can happen if the type is interface extending `scala.Product`. Also, since this is a protected method, this could have been other arbitrary types without constructor. ### Why are the changes needed? - The error message `<none> is not a term` isn't helpful for users to understand the problem. ### Does this PR introduce any user-facing change? - The exception would be thrown instead of runtime exception from the `scala.ScalaReflectionException`. ### How was this patch tested? - Added a unit test to illustrate the `type` where expression encoder will fail and trigger the proposed error message. Closes #25736 from mickjermsurawong-stripe/SPARK-29026. Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-11 08:43:40 +09:00
Terry Kim	bf43541c92	[SPARK-28856][SQL] Implement SHOW DATABASES for Data Source V2 Tables ### What changes were proposed in this pull request? Implement the SHOW DATABASES logical and physical plans for data source v2 tables. ### Why are the changes needed? To support `SHOW DATABASES` SQL commands for v2 tables. ### Does this PR introduce any user-facing change? `spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set: ``` +---------------+ \| namespace\| +---------------+ \| ns1\| \| ns1.ns1_1\| \|ns1.ns1_1.ns1_2\| +---------------+ ``` ### How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25601 from imback82/show_databases. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-10 21:23:57 +08:00
gengjiaan	aafce7ebff	[SPARK-28412][SQL] ANSI SQL: OVERLAY function support byte array ## What changes were proposed in this pull request? This is a ANSI SQL and feature id is `T312` ``` <binary overlay function> ::= OVERLAY <left paren> <binary value expression> PLACING <binary value expression> FROM <start position> [ FOR <string length> ] <right paren> ``` This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array. ref: https://www.postgresql.org/docs/11/functions-binarystring.html ## How was this patch tested? new UT. There are some show of the PR on my production environment. ``` spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6); Spark_SQL Time taken: 0.285 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7); Spark CORE Time taken: 0.202 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0); Spark ANSI SQL Time taken: 0.165 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4); Structured SQL Time taken: 0.141 s ``` Closes #25172 from beliefer/ansi-overlay-byte-array. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-10 08:16:18 +09:00
Marco Gaido	3d6b33a49a	[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd ### What changes were proposed in this pull request? The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it. In this way, SQL configs are effective also in these cases, while earlier they were ignored. ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Does this PR introduce any user-facing change? When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy. ### How was this patch tested? added UT Closes #25643 from mgaido91/SPARK-28939. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 21:20:34 +08:00
Wenchen Fan	abec6d7763	[SPARK-28341][SQL] create a public API for V2SessionCatalog ## What changes were proposed in this pull request? The `V2SessionCatalog` has 2 functionalities: 1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`. 2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog). To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage. This PR does 2 things: 1. refine the document of the config `spark.sql.catalog.session`. 2. add a public abstract class `CatalogExtension` for users to write implementations. TODOs for followup PRs: 1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one. 2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names. ## How was this patch tested? existing tests Closes #25104 from cloud-fan/session-catalog. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 21:14:37 +08:00
turbofei	d4eca7c99d	[SPARK-29000][SQL] Decimal precision overflow when don't allow precision loss ### What changes were proposed in this pull request? When we set spark.sql.decimalOperations.allowPrecisionLoss to false. For the sql below, the result will overflow and return null. Case a: `select case when 1=2 then 1 else 1.000000000000000000000001 end * 1` Similar with the division operation. This sql below will lost precision. Case b: `select case when 1=2 then 1 else 1.000000000000000000000001 end / 1` Let us check the code of TypeCoercion.scala. `a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875)`. For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a common type and cast both operands to common type. So, for these cases menthioned, their left operand is Decimal(34, 24) and right operand is Literal. Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24). Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class. Let's check the relative code. `a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)` When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20). Then the multi operation in case a will overflow and division operation in case b will lost precision. In this PR, we skip to handle the binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it. ### Why are the changes needed? Data will corrupt without this change. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25701 from turboFei/SPARK-29000. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 13:50:17 +08:00
Marco Gaido	c411579355	[SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable\|Unsafe]Projection ### What changes were proposed in this pull request? The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable\|Unsafe]Projection`. ### Why are the changes needed? Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT Closes #25642 from mgaido91/SPARK-28916. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 13:30:56 +08:00
maryannxue	b2f06608b7	[SPARK-29002][SQL] Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions ### What changes were proposed in this pull request? This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ. ### Why are the changes needed? It is a performance improvement for AQE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #25703 from maryannxue/aqe-demote-bhj. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-06 12:46:54 -07:00
Maxim Gekk	67b4329fb0	[SPARK-28690][SQL] Add `date_part` function for timestamps/dates ## What changes were proposed in this pull request? In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`: ``` date_part('field', source) ``` and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT). The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are: - `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. - `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. - `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. - isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. - `year`, `month`, `day`, `hour`, `minute`, `second` - `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. - `quarter` - the quarter of the year (1 - 4) - `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday) - `dow` - the day of the week as Sunday (0) to Saturday (6) - `isodow` - the day of the week as Monday (1) to Sunday (7) - `doy` - the day of the year (1 - 365/366) - `milliseconds` - the seconds field including fractional parts multiplied by 1,000. - `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. - `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. Here are examples: ```sql spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456'); 2019 spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456'); 224 ``` I changed implementation of `extract` to re-use `date_part()` internally. ## How was this patch tested? Added `date_part.sql` and regenerated results of `extract.sql`. Closes #25410 from MaxGekk/date_part. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-06 23:36:00 +09:00
Takeshi Yamamuro	cb0cddffe9	[SPARK-21870][SQL] Split aggregation code into small functions ## What changes were proposed in this pull request? This pr proposed to split aggregation code into small functions in `HashAggregateExec`. In #18810, we got performance regression if JVMs didn't compile too long functions. I checked and I found the codegen of `HashAggregateExec` frequently goes over the limit when a query has too many aggregate functions (e.g., q66 in TPCDS). The current master places all the generated aggregation code in a single function. In this pr, I modified the code to assign an individual function for each aggregate function (e.g., `SUM` and `AVG`). For example, in a query `SELECT SUM(a), AVG(a) FROM VALUES(1) t(a)`, the proposed code defines two functions for `SUM(a)` and `AVG(a)` as follows; - generated code with this pr (https://gist.github.com/maropu/812990012bc967a78364be0fa793f559): ``` /* 173 / private void agg_doConsume_0(InternalRow inputadapter_row_0, long agg_expr_0_0, boolean agg_exprIsNull_0_0, double agg_expr_1_0, boolean agg_exprIsNull_1_0, long agg_expr_2_0, boolean agg_exprIsNull_2_0) throws java.io.IOException { / 174 / // do aggregate / 175 / // common sub-expressions / 176 / / 177 / // evaluate aggregate functions and update aggregation buffers / 178 / agg_doAggregate_sum_0(agg_exprIsNull_0_0, agg_expr_0_0); / 179 / agg_doAggregate_avg_0(agg_expr_1_0, agg_exprIsNull_1_0, agg_exprIsNull_2_0, agg_expr_2_0); / 180 / / 181 / } ... / 071 / private void agg_doAggregate_avg_0(double agg_expr_1_0, boolean agg_exprIsNull_1_0, boolean agg_exprIsNull_2_0, long agg_expr_2_0) throws java.io.IOException { / 072 / // do aggregate for avg / 073 / // evaluate aggregate function / 074 / boolean agg_isNull_19 = true; / 075 / double agg_value_19 = -1.0; ... / 114 / private void agg_doAggregate_sum_0(boolean agg_exprIsNull_0_0, long agg_expr_0_0) throws java.io.IOException { / 115 / // do aggregate for sum / 116 / // evaluate aggregate function / 117 / agg_agg_isNull_11_0 = true; / 118 / long agg_value_11 = -1L; ``` - generated code in the current master (https://gist.github.com/maropu/e9d772af2c98d8991a6a5f0af7841760) ``` / 059 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0) throws java.io.IOException { / 060 / // do aggregate / 061 / // common sub-expressions / 062 / boolean agg_isNull_4 = false; / 063 / long agg_value_4 = -1L; / 064 / if (!false) { / 065 / agg_value_4 = (long) agg_expr_0_0; / 066 / } / 067 / // evaluate aggregate function / 068 / agg_agg_isNull_7_0 = true; / 069 / long agg_value_7 = -1L; / 070 / do { / 071 / if (!agg_bufIsNull_0) { / 072 / agg_agg_isNull_7_0 = false; / 073 / agg_value_7 = agg_bufValue_0; / 074 / continue; / 075 / } / 076 / / 077 / boolean agg_isNull_9 = false; / 078 / long agg_value_9 = -1L; / 079 / if (!false) { / 080 / agg_value_9 = (long) 0; / 081 / } / 082 / if (!agg_isNull_9) { / 083 / agg_agg_isNull_7_0 = false; / 084 / agg_value_7 = agg_value_9; / 085 / continue; / 086 / } / 087 / / 088 / } while (false); / 089 / / 090 / long agg_value_6 = -1L; / 091 / / 092 / agg_value_6 = agg_value_7 + agg_value_4; / 093 / boolean agg_isNull_11 = true; / 094 / double agg_value_11 = -1.0; / 095 / / 096 / if (!agg_bufIsNull_1) { / 097 / agg_agg_isNull_13_0 = true; / 098 / double agg_value_13 = -1.0; / 099 / do { / 100 / boolean agg_isNull_14 = agg_isNull_4; / 101 / double agg_value_14 = -1.0; / 102 / if (!agg_isNull_4) { / 103 / agg_value_14 = (double) agg_value_4; / 104 / } / 105 / if (!agg_isNull_14) { / 106 / agg_agg_isNull_13_0 = false; / 107 / agg_value_13 = agg_value_14; / 108 / continue; / 109 / } / 110 / / 111 / boolean agg_isNull_15 = false; / 112 / double agg_value_15 = -1.0; / 113 / if (!false) { / 114 / agg_value_15 = (double) 0; / 115 / } / 116 / if (!agg_isNull_15) { / 117 / agg_agg_isNull_13_0 = false; / 118 / agg_value_13 = agg_value_15; / 119 / continue; / 120 / } / 121 / / 122 / } while (false); / 123 / / 124 / agg_isNull_11 = false; // resultCode could change nullability. / 125 / / 126 / agg_value_11 = agg_bufValue_1 + agg_value_13; / 127 / / 128 / } / 129 / boolean agg_isNull_17 = false; / 130 / long agg_value_17 = -1L; / 131 / if (!false && agg_isNull_4) { / 132 / agg_isNull_17 = agg_bufIsNull_2; / 133 / agg_value_17 = agg_bufValue_2; / 134 / } else { / 135 / boolean agg_isNull_20 = true; / 136 / long agg_value_20 = -1L; / 137 / / 138 / if (!agg_bufIsNull_2) { / 139 / agg_isNull_20 = false; // resultCode could change nullability. / 140 / / 141 / agg_value_20 = agg_bufValue_2 + 1L; / 142 / / 143 / } / 144 / agg_isNull_17 = agg_isNull_20; / 145 / agg_value_17 = agg_value_20; / 146 / } / 147 / // update aggregation buffer / 148 / agg_bufIsNull_0 = false; / 149 / agg_bufValue_0 = agg_value_6; / 150 / / 151 / agg_bufIsNull_1 = agg_isNull_11; / 152 / agg_bufValue_1 = agg_value_11; / 153 / / 154 / agg_bufIsNull_2 = agg_isNull_17; / 155 / agg_bufValue_2 = agg_value_17; / 156 / / 157 */ } ``` You can check the previous discussion in https://github.com/apache/spark/pull/19082 ## How was this patch tested? Existing tests Closes #20965 from maropu/SPARK-21870-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-06 11:45:14 +08:00
WeichenXu	f8bc91f749	[SPARK-28782][SQL] Generator support in aggregate expressions ### What changes were proposed in this pull request? Support generator in aggregate expressions. In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e: ``` aggregate(with generator) \|--child_plan ``` ===> ``` project \|--generator(resolved) \|--aggregate \|--child_plan ``` ### Why are the changes needed? We should support sql like: ``` select explode(array(min(a), max(a))) from t ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test added. Closes #25512 from WeichenXu123/explode_bug. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 16:17:49 +08:00
Ryan Blue	5adaa2e103	[SPARK-28979][SQL] Rename UnresovledTable to V1Table ### What changes were proposed in this pull request? Rename `UnresolvedTable` to `V1Table` because it is not unresolved. ### Why are the changes needed? The class name is inaccurate. This should be fixed before it is in a release. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25683 from rdblue/SPARK-28979-rename-unresolved-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 11:41:21 +08:00
maryannxue	a7a3935c97	[SPARK-11150][SQL] Dynamic Partition Pruning ### What changes were proposed in this pull request? This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: 1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or 2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise 3. As a bypassed condition (`true`). ### Why are the changes needed? This is an important performance feature. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT - Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature. - Testing DPP with reused broadcast results. - Testing the key iterators on different HashedRelation types. - Testing the packing and unpacking of the broadcast keys in a LongType. Closes #25600 from maryannxue/dpp. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-04 13:13:23 -07:00
Xianjin YE	d5688dc732	[SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table ## What changes were proposed in this pull request? Datasource table now supports partition tables long ago. This commit adds the ability to translate the InsertIntoTable(HiveTableRelation) to datasource table insertion. ## How was this patch tested? Existing tests with some modification Closes #25306 from advancedxy/SPARK-28573. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-03 13:40:06 +08:00
sandeep katta	e1946a598b	[SPARK-28705][SQL][TEST] Drop tables after being used in AnalysisExternalCatalogSuite ## What changes were proposed in this pull request? drop the table after the test `query builtin functions don't call the external catalog` executed This is required for [SPARK-25464](https://github.com/apache/spark/pull/22466) ## How was this patch tested? existing UT Closes #25427 from sandeep-katta/cleanuptable. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-02 20:32:32 +09:00
HyukjinKwon	bd3915e356	Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API" This reverts commit `3821d75b83`.	2019-09-02 12:47:14 +09:00
Sean Owen	eb037a8180	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations ### What changes were proposed in this pull request? The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R. The changes below can be summarized as: - Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental - Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched - I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering) It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those. ### Why are the changes needed? Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25558 from srowen/SPARK-28855. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-01 10:15:00 -05:00
Ryan Blue	3821d75b83	[SPARK-28612][SQL] Add DataFrameWriterV2 API ## What changes were proposed in this pull request? This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API: * Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans. * Only creates v2 logical plans so the behavior is always consistent. * Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`. Here are a few example uses of the new API: ```scala df.writeTo("catalog.db.table").append() df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01") df.writeTo("catalog.db.table").overwritePartitions() df.writeTo("catalog.db.table").asParquet.create() df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace() df.writeTo("catalog.db.table").using("abc").replace() ``` ## How was this patch tested? Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans. Closes #25354 from rdblue/SPARK-28612-add-data-frame-writer-v2. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-08-31 21:28:20 -07:00
younggyu chun	3b07a4eb28	[SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type ## What changes were proposed in this pull request? This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases. https://www.postgresql.org/docs/devel/datatype-boolean.html https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html ## How was this patch tested? Added new tests to CastSuite. Closes #25458 from younggyuchun/SPARK-27931. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-30 14:18:13 -07:00
Burak Yavuz	827969399b	[SPARK-28668][SQL] Support V2SessionCatalog for ALTER TABLE ### What changes were proposed in this pull request? Adds support for the V2SessionCatalog for ALTER TABLE statements. Implementation changes are ~50 loc. The rest is just test refactoring. ### Why are the changes needed? To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements. ### How was this patch tested? By re-using existing tests in DataSourceV2SQLSuite. Closes #25502 from brkyvz/alterV3. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-30 14:16:47 +08:00
Wenchen Fan	f8f7c52f12	[SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core ### What changes were proposed in this pull request? There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them. After merging, there are 3 classes: 1. `InMemoryTable` 2. `InMemoryTableCatalog` 3. `StagingInMemoryTableCatalog` For better maintainability, these 3 classes are put in 3 different files. ### Why are the changes needed? reduce duplicated code ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #25610 from cloud-fan/dsv2-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-08-29 12:56:19 -07:00
Gengliang Wang	24655583f1	[SPARK-28495][SQL][FOLLOW-UP] Disallow conversions between timestamp and long in ASNI mode ### What changes were proposed in this pull request? Disallow conversions between `timestamp` type and `long` type in table insertion with ANSI store assignment policy. ### Why are the changes needed? In the PR https://github.com/apache/spark/pull/25581, timestamp type is allowed to be converted to long type, since timestamp type is represented by long type internally, and both legacy mode and strict mode allows the conversion. After reconsideration, I think we should disallow it. As per ANSI SQL section "4.4.2 Characteristics of numbers": > A number is assignable only to sites of numeric type. In PostgreSQL, the conversion between timestamp and long is also disallowed. ### Does this PR introduce any user-facing change? Conversion between timestamp and long is disallowed in table insertion with ANSI store assignment policy. ### How was this patch tested? Unit test Closes #25615 from gengliangwang/disallowTimeStampToLong. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-29 19:59:24 +08:00
Gengliang Wang	9d6bec183c	[SPARK-28730][SPARK-28495][SQL][FOLLOW-UP] Revise the doc of option spark.sql.storeAssignmentPolicy ### What changes were proposed in this pull request? Revise the documentation of SQL option `spark.sql.storeAssignmentPolicy`. ### Why are the changes needed? 1. Need to point out the ANSI mode is mostly the same with PostgreSQL 2. Need to point out Legacy mode allows type coercion as long as it is valid casting 3. Better examples. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Uni test Closes #25605 from gengliangwang/reviseDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-28 19:59:53 +08:00
Yuming Wang	e3b32da027	[SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs ## What changes were proposed in this pull request? This PR update `spark.sql.statistics.fallBackToHdfs`'s doc: 1. This flag is effective only if it is Hive table. 2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available 3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available. Related code: - Non-partitioned data source table: [SizeInBytesOnlyStatsPlanVisitor.default()](`98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)`) -> [LogicalRelation.computeStats()](`a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)`) -> [HadoopFsRelation.sizeInBytes()](`c0632cec04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala (L72-L75)`) -> [PartitioningAwareFileIndex.sizeInBytes()](`b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103)`) `PartitioningAwareFileIndex.sizeInBytes()` is calculated by [`allFiles().map(_.getLen).sum`](`b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103)`) if table statistics are not available. - Partitioned data source table: [SizeInBytesOnlyStatsPlanVisitor.default()](`98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)`) -> [LogicalRelation.computeStats()](`a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)`) -> [CatalogFileIndex.sizeInBytes](`5d672b7f3e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala (L41)`) `CatalogFileIndex.sizeInBytes` is [spark.sql.defaultSizeInBytes](`c30b5297bc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L387)`) if table statistics are not available. ## How was this patch tested? N/A Closes #24715 from wangyum/SPARK-25474. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-28 19:15:26 +08:00
Gengliang Wang	2b24a71fec	[SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion ### What changes were proposed in this pull request? Introduce ANSI store assignment policy for table insertion. With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL. ### Why are the changes needed? In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column. In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases. Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql. ### Does this PR introduce any user-facing change? A new optional mode for table insertion. ### How was this patch tested? Unit test Closes #25581 from gengliangwang/ANSImode. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 22:13:23 +08:00
WeichenXu	7f605f5559	[SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true ### What changes were proposed in this pull request? Make `spark.sql.crossJoin.enabled` default value true ### Why are the changes needed? For implicit cross join, we can set up a watchdog to cancel it if running for a long time. When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user: * it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast. * if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing. * the CROSS JOIN syntax doesn't work well if join reorder happens. * some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error. So that in order to address this in simpler way, we can turn off showing this cross-join error by default. For reference, I list some cases raising mismatching error here: Providing: ``` spark.range(2).createOrReplaceTempView("sm1") // can be broadcast spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast ``` 1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g. ``` select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id ``` 2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g. ``` select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id ``` ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #25520 from WeichenXu123/SPARK-28621. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:53:37 +08:00
Wenchen Fan	cb06209fc9	[SPARK-28747][SQL] merge the two data source v2 fallback configs ## What changes were proposed in this pull request? Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path. However, I found it's awkward to work with these 2 configs: 1. for `CREATE TABLE USING format`, should this be read path or write path? 2. for `V2SessionCatalog.loadTable`, we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write. We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately. This PR proposes to merge these 2 configs into one. ## How was this patch tested? existing tests Closes #25465 from cloud-fan/merge-conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 20:47:24 +08:00
Burak Yavuz	e31aec9be4	[SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog ### What changes were proposed in this pull request? This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog. ### Why are the changes needed? This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs. ### Does this PR introduce any user-facing change? No behavior changes. ### How was this patch tested? Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths. Closes #25507 from brkyvz/insertSesh. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 12:59:53 +08:00
Dilip Biswal	c61270fd74	[SPARK-27395][SQL] Improve EXPLAIN command ## What changes were proposed in this pull request? This PR aims at improving the way physical plans are explained in spark. Currently, the explain output for physical plan may look very cluttered and each operator's string representation can be very wide and wraps around in the display making it little hard to follow. This especially happens when explaining a query 1) Operating on wide tables 2) Has complex expressions etc. This PR attempts to split the output into two sections. In the header section, we display the basic operator tree with a number associated with each operator. In this section, we strictly control what we output for each operator. In the footer section, each operator is verbosely displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be correlated by the originating expression id from its parent plan. To illustrate, here is a simple plan displayed in old vs new way. Example query1 : ``` EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0 ``` Old : ``` (2) Project [key#2, max(val)#15] +- (2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0)) +- (2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18]) +- Exchange hashpartitioning(key#2, 200) +- (1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21]) +- (1) Project [key#2, val#3] +- (1) Filter (isnotnull(key#2) AND (key#2 > 0)) +- (1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int> ``` New : ``` Project (8) +- Filter (7) +- HashAggregate (6) +- Exchange (5) +- HashAggregate (4) +- Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (isnotnull(key#2) AND (key#2 > 0)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] (4) HashAggregate [codegen id : 1] Input: [key#2, val#3] (5) Exchange Input: [key#2, max#11] (6) HashAggregate [codegen id : 2] Input: [key#2, max#11] (7) Filter [codegen id : 2] Input : [key#2, max(val)#5, max(val#3)#8] Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0)) (8) Project [codegen id : 2] Output : [key#2, max(val)#5] Input : [key#2, max(val)#5, max(val#3)#8] ``` Example Query2 (subquery): ``` SELECT FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3 ``` Old: ``` (1) Project [key#2, val#3] +- (1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3)) : +- Subquery scalar-subquery#39 : +- (2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)#45]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47]) : +- (1) Project [key#26] : +- (1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2)) : : +- Subquery scalar-subquery#38 : : +- (2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)#43]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49]) : : +- (1) Project [key#28] : : +- (1) Filter (isnotnull(val#29) AND (val#29 > 0)) : : +- (1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int> : +- (1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int> +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int> ``` New: ``` Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23 HashAggregate (9) +- Exchange (8) +- HashAggregate (7) +- Project (6) +- Filter (5) +- Scan parquet default.explain_temp2 (4) (4) Scan parquet default.explain_temp2 [codegen id : 1] Output: [key#26, val#27] (5) Filter [codegen id : 1] Input : [key#26, val#27] Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2)) (6) Project [codegen id : 1] Output : [key#26] Input : [key#26, val#27] (7) HashAggregate [codegen id : 1] Input: [key#26] (8) Exchange Input: [max#35] (9) HashAggregate [codegen id : 2] Input: [max#35] Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22 HashAggregate (15) +- Exchange (14) +- HashAggregate (13) +- Project (12) +- Filter (11) +- Scan parquet default.explain_temp3 (10) (10) Scan parquet default.explain_temp3 [codegen id : 1] Output: [key#28, val#29] (11) Filter [codegen id : 1] Input : [key#28, val#29] Condition : (isnotnull(val#29) AND (val#29 > 0)) (12) Project [codegen id : 1] Output : [key#28] Input : [key#28, val#29] (13) HashAggregate [codegen id : 1] Input: [key#28] (14) Exchange Input: [max#37] (15) HashAggregate [codegen id : 2] Input: [max#37] ``` Note: I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow would not be able to immediately incorporate the feedback. I will start to work on them as soon as i can. Also, currently this PR provides a basic infrastructure for explain enhancement. The details about individual operators will be implemented in follow-up prs ## How was this patch tested? Added a new test `explain.sql` that tests basic scenarios. Need to add more tests. Closes #24759 from dilipbiswal/explain_feature. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-26 20:37:13 +08:00
Terry Kim	a3328cdc0a	[SPARK-28238][SQL][FOLLOW-UP] Clean up attributes for Datasource v2 DESCRIBE TABLE ### What changes were proposed in this pull request? 1. Fix the physical plan (`DescribeTableExec`) to have the same output attributes as the corresponding logical plan. 2. Remove `output` in statements since they are unresolved plans. ### Why are the changes needed? Correctness of how output attributes should work. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Existing tests Closes #25568 from imback82/describe_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-26 13:39:36 +08:00
Gengliang Wang	8258660f67	[SPARK-28741][SQL] Optional mode: throw exceptions when casting to integers causes overflow ## What changes were proposed in this pull request? To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow. The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow. To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow" ## How was this patch tested? Unit test Closes #25461 from gengliangwang/AnsiCastIntegral. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 21:49:45 +08:00
Ali Afroozeh	aef7ca1f0b	[SPARK-28836][SQL] Remove the canonicalize(attributes) method from PlanExpression ### What changes were proposed in this pull request? This PR removes the `canonicalize(attrs: AttributeSeq)` from `PlanExpression` and taking care of normalizing expressions in `QueryPlan`. ### Why are the changes needed? `Expression` has already a `canonicalized` method and having the `canonicalize` method in `PlanExpression` is confusing. ### Does this PR introduce any user-facing change? Removes the `canonicalize` plan from `PlanExpression`. Also renames the `normalizeExprId` to `normalizeExpressions` in query plan. ### How was this patch tested? This PR is a refactoring and passes the existing tests Closes #25534 from dbaliafroozeh/ImproveCanonicalizeAPI. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-08-23 13:26:58 +02:00
terryk	98e1a4cea4	[SPARK-28319][SQL] Implement SHOW TABLES for Data Source V2 Tables ## What changes were proposed in this pull request? Implements the SHOW TABLES logical and physical plans for data source v2 tables. ## How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25247 from imback82/dsv2_show_tables. Lead-authored-by: terryk <yuminkim@gmail.com> Co-authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 14:20:25 +08:00
Gengliang Wang	895c90b582	[SPARK-28730][SQL] Configurable type coercion policy for table insertion ## What changes were proposed in this pull request? After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562. Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent. When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules: legacy and strict. 1. With legacy policy, Spark allows casting any value to any data type. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive. 2. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed. Eventually, the "legacy" mode will be removed, so it is disallowed in data source V2. To ensure backward compatibility with existing queries, the default store assignment policy for data source V1 is "legacy". ## How was this patch tested? Unit test Closes #25453 from gengliangwang/tableInsertRule. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 13:50:26 +08:00
triplesheep	48578a41b5	[SPARK-28844][SQL] Fix typo in SQLConf FILE_COMRESSION_FACTOR ### What changes were proposed in this pull request? Fix minor typo in SQLConf. `FILE_COMRESSION_FACTOR` -> `FILE_COMPRESSION_FACTOR` ### Why are the changes needed? Make conf more understandable. ### Does this PR introduce any user-facing change? No. (`spark.sql.sources.fileCompressionFactor` is unchanged.) ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #25538 from triplesheep/TYPO-FIX. Authored-by: triplesheep <triplesheep0419@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-22 00:07:40 -07:00

1 2 3 4 5 ...

3890 commits