ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Burak Yavuz	4855bfe16b	[SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths ## What changes were proposed in this pull request? This PR adds a V1 fallback interface for writing to V2 Tables using V1 Writer interfaces. The only supported SaveMode that will be called on the target table will be an Append. The target table must use V2 interfaces such as `SupportsOverwrite` or `SupportsTruncate` to support Overwrite operations. It is up to the target DataSource implementation if this operation can be atomic or not. We do not support dynamicPartitionOverwrite, as we cannot call a `commit` method that actually cleans up the data in the partitions that were touched through this fallback. ## How was this patch tested? Will add tests and example implementation after comments + feedback. This is a proposal at this point. Closes #25348 from brkyvz/v1WriteFallback. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-21 17:25:25 +08:00
Marco Gaido	0bfcf9c210	[SPARK-28322][SQL] Add support to Decimal type for integral divide ## What changes were proposed in this pull request? The expression `IntegralDivide`, which corresponds to the `div` operator, support only integral type. Postgres, though, allows it to work also with decimals. The PR adds the support to decimal operands for this operation in order to have feature parity with postgres. ## How was this patch tested? added UTs Closes #25136 from mgaido91/SPARK-28322. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-21 08:43:00 +09:00
Wenchen Fan	d04522187a	[SPARK-28635][SQL] create CatalogManager to track registered v2 catalogs ## What changes were proposed in this pull request? This is a pure refactor PR, which creates a new class `CatalogManager` to track the registered v2 catalogs, and provide the catalog up functionality. `CatalogManager` also tracks the current catalog/namespace. We will implement corresponding commands in other PRs, like `USE CATALOG my_catalog` ## How was this patch tested? existing tests Closes #25368 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-20 19:40:21 +08:00
Sean Owen	3b4e345fa1	[SPARK-28775][CORE][TESTS] Skip date 8633 in Kwajalein due to changes in tzdata2018i that only some JDK 8s use ### What changes were proposed in this pull request? Some newer JDKs use the tzdata2018i database, which changes how certain (obscure) historical dates and timezones are handled. As previously, we can pretty much safely ignore these in tests, as the value may vary by JDK. ### Why are the changes needed? Test otherwise fails using, for example, JDK 1.8.0_222. https://bugs.openjdk.java.net/browse/JDK-8215982 has a full list of JDKs which has this. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #25504 from srowen/SPARK-28775. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-19 17:54:25 -07:00
Mick Jermsurawong	b79cf0d143	[SPARK-28224][SQL] Check overflow in decimal Sum aggregate ## What changes were proposed in this pull request? - Currently `sum` in aggregates for decimal type can overflow and return null. - `Sum` expression codegens arithmetic on `sql.Decimal` and the output which preserves scale and precision goes into `UnsafeRowWriter`. Here overflowing will be converted to null when writing out. - It also does not go through this branch in `DecimalAggregates` because it's expecting precision of the sum (not the elements to be summed) to be less than 5. `4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L1400-L1403)` - This PR adds the check at the final result of the sum operator itself. `4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (L372-L376)` https://issues.apache.org/jira/browse/SPARK-28224 ## How was this patch tested? - Added an integration test on dataframe suite cc mgaido91 JoshRosen Closes #25033 from mickjermsurawong-stripe/SPARK-28224. Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-20 09:47:04 +09:00
Takuya UESHIN	26f344354b	[SPARK-27905][SQL][FOLLOW-UP] Add prettyNames ### What changes were proposed in this pull request? This is a follow-up of #24761 which added a higher-order function `ArrayForAll`. The PR mistakenly removed the `prettyName` from `ArrayExists` and forgot to add it to `ArrayForAll`. ### Why are the changes needed? This reverts the `prettyName` back to `ArrayExists` not to affect explained plans, and adds it to `ArrayForAll` to clarify the `prettyName` as the same as the expressions around. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25501 from ueshin/issues/SPARK-27905/pretty_names. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-19 15:15:50 -07:00
Yuming Wang	c308ab5a29	[MINOR][SQL] Make analysis error msg more meaningful on DISTINCT queries ## What changes were proposed in this pull request? This PR makes analysis error messages more meaningful when the function does not support the modifier DISTINCT: ```sql postgres=# select upper(distinct a) from (values('a'), ('b')) v(a); ERROR: DISTINCT specified, but upper is not an aggregate function LINE 1: select upper(distinct a) from (values('a'), ('b')) v(a); spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a); Error in query: upper does not support the modifier DISTINCT; line 1 pos 7 spark-sql> ``` After this pr: ```sql spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a); Error in query: DISTINCT specified, but upper is not an aggregate function; line 1 pos 7 spark-sql> ``` ## How was this patch tested? Unit test Closes #25486 from wangyum/DISTINCT. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-18 08:36:01 -07:00
pavithra	c48e381214	[SPARK-28671][SQL] Throw NoSuchPermanentFunctionException for a non-exsistent permanent function in dropFunction/alterFunction ## What changes were proposed in this pull request? Before Fix When a non existent permanent function is dropped, generic NoSuchFunctionException was thrown.- which printed "This function is neither a registered temporary function nor a permanent function registered in the database" . This creates a ambiguity when a temp function in the same name exist. After Fix NoSuchPermanentFunctionException will be thrown, which will print "NoSuchPermanentFunctionException:Function not found in database " ## How was this patch tested? Unit test was run and corrected the UT. Closes #25394 from PavithraRamachandran/funcIssue. Lead-authored-by: pavithra <pavi.rams@gmail.com> Co-authored-by: pavithraramachandran <pavi.rams@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-16 22:46:04 +09:00
Burak Yavuz	0526529b31	[SPARK-28666] Support saveAsTable for V2 tables through Session Catalog ## What changes were proposed in this pull request? We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog. ## How was this patch tested? Unit tests. A lot of tests broke under hive when things were not working properly under `ResolveTables`, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths. Closes #25402 from brkyvz/saveAsV2. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-15 12:29:34 +08:00
Maxim Gekk	3a4afce96c	[SPARK-28687][SQL] Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()` ## What changes were proposed in this pull request? In the PR, I propose new expressions `Epoch`, `IsoYear`, `Milliseconds` and `Microseconds`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. 2. `isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. 3. `milliseconds` - the seconds field including fractional parts multiplied by 1,000. 4. `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. Here are examples: ```sql spark-sql> SELECT EXTRACT(EPOCH FROM TIMESTAMP '2019-08-11 19:07:30.123456'); 1565550450.123456 spark-sql> SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-01'); 2005 spark-sql> SELECT EXTRACT(MILLISECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456'); 30123.456 spark-sql> SELECT EXTRACT(MICROSECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456'); 30123456 ``` ## How was this patch tested? Added new tests to `DateExpressionsSuite`, and uncommented existing tests in `extract.sql` and `pgSQL/date.sql`. Closes #25408 from MaxGekk/extract-ext3. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-14 08:44:44 -07:00
xy_xin	2eeb25e52d	[SPARK-28351][SQL] Support DELETE in DataSource V2 ## What changes were proposed in this pull request? This pr adds DELETE support for V2 datasources. As a first step, this pr only support delete by source filters: ```scala void delete(Filter[] filters); ``` which could not deal with complicated cases like subqueries. Since it's uncomfortable to embed the implementation of DELETE in the current V2 APIs, a new mix-in of datasource is added, which is called `SupportsMaintenance`, similar to `SupportsRead` and `SupportsWrite`. A datasource which can be maintained means we can perform DELETE/UPDATE/MERGE/OPTIMIZE on the datasource, as long as the datasource implements the necessary mix-ins. ## How was this patch tested? new test case. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25115 from xianyinxin/SPARK-28351. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-14 23:38:45 +08:00
Edgar Rodriguez	598fcbe5ed	[SPARK-28265][SQL] Add renameTable to TableCatalog API ## What changes were proposed in this pull request? This PR adds the `renameTable` call to the `TableCatalog` API, as described in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). This PR is related to: https://github.com/apache/spark/pull/24246 ## How was this patch tested? Added unit tests and contract tests. Closes #25206 from edgarRd/SPARK-28265-add-rename-table-catalog-api. Authored-by: Edgar Rodriguez <edgar.rd@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-14 14:24:13 +08:00
Dilip Biswal	331f2657d9	[SPARK-27768][SQL] Support Infinity/NaN-related float/double literals case-insensitively ## What changes were proposed in this pull request? Here is the problem description from the JIRA. ``` When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results. SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('infinity'), ('1')) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('infinity'), ('infinity')) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('-infinity'), ('infinity')) v(x); The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way. Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html ``` In this PR, the casting code is enhanced to handle these `special` string literals in case insensitive manner. ## How was this patch tested? Added tests in CastSuite and modified existing test suites. Closes #25331 from dilipbiswal/double_infinity. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 16:48:30 -07:00
Maxim Gekk	3d85c54895	[SPARK-28700][SQL] Use DECIMAL type for `sec` in `make_timestamp()` ## What changes were proposed in this pull request? Changed type of `sec` argument in the `make_timestamp()` function from `DOUBLE` to `DECIMAL(8, 6)`. The scale is set to 6 to cover microsecond fractions, and the precision is 2 digits for seconds + 6 digits for microsecond fraction. New type prevents losing precision in some cases, for example: Before: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001); 2019-08-12 00:00:58 ``` After: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001); 2019-08-12 00:00:58.000001 ``` Also switching to `DECIMAL` fixes rounding `sec` towards "nearest neighbor" unless both neighbors are equidistant, in which case round up. For example: Before: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567); 2019-08-12 00:00:00.123456 ``` After: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567); 2019-08-12 00:00:00.123457 ``` ## How was this patch tested? This was tested by `DateExpressionsSuite` and `pgSQL/timestamp.sql`. Closes #25421 from MaxGekk/make_timestamp-decimal. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 15:51:50 -07:00
Maxim Gekk	f04a766946	[SPARK-28718][SQL] Support `field` synonyms at `extract` ## What changes were proposed in this pull request? In the PR, I propose additional synonyms for the `field` argument of `extract` supported by PostgreSQL. The `extract.sql` is updated to check all supported values of the `field` argument. The list of synonyms was taken from https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/datetime.c . ## How was this patch tested? By running `extract.sql` via: ``` $ build/sbt "sql/test-only *SQLQueryTestSuite -- -z extract.sql" ``` Closes #25438 from MaxGekk/extract-field-synonyms. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 15:36:28 -07:00
Liang-Chi Hsieh	e6a0385289	[SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause ## What changes were proposed in this pull request? A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`. This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`. When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf. ## How was this patch tested? Added tests. Closes #25352 from viirya/SPARK-28422. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-14 00:32:33 +09:00
Xingbo Jiang	3249c7ab49	[SPARK-28706][SQL] Allow cast null type to any types ## What changes were proposed in this pull request? #25242 proposed to disallow upcasting complex data types to string type, however, upcasting from null type to any types should still be safe. ## How was this patch tested? Add corresponding case in `CastSuite`. Closes #25425 from jiangxb1987/nullToString. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-13 19:02:04 +08:00
Yuming Wang	47af8925b6	[SPARK-28675][SQL] Remove maskCredentials and use redactOptions ## What changes were proposed in this pull request? This PR replaces `CatalogUtils.maskCredentials` with `SQLConf.get.redactOptions` to match other redacts. ## How was this patch tested? unit test and manual tests: Before this PR: ```sql spark-sql> DESC EXTENDED test_spark_28675; id int NULL # Detailed Table Information Database default Table test_spark_28675 Owner root Created Time Fri Aug 09 08:23:17 GMT-07:00 2019 Last Access Wed Dec 31 17:00:00 GMT-07:00 1969 Created By Spark 3.0.0-SNAPSHOT Type MANAGED Provider org.apache.spark.sql.jdbc Location file:/user/hive/warehouse/test_spark_28675 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675'; default test_spark_28675 false Database: default Table: test_spark_28675 Owner: root Created Time: Fri Aug 09 08:23:17 GMT-07:00 2019 Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969 Created By: Spark 3.0.0-SNAPSHOT Type: MANAGED Provider: org.apache.spark.sql.jdbc Location: file:/user/hive/warehouse/test_spark_28675 Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties: [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] Schema: root \|-- id: integer (nullable = true) ``` After this PR: ```sql spark-sql> DESC EXTENDED test_spark_28675; id int NULL # Detailed Table Information Database default Table test_spark_28675 Owner root Created Time Fri Aug 09 08:19:49 GMT-07:00 2019 Last Access Wed Dec 31 17:00:00 GMT-07:00 1969 Created By Spark 3.0.0-SNAPSHOT Type MANAGED Provider org.apache.spark.sql.jdbc Location file:/user/hive/warehouse/test_spark_28675 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [url=*******(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675'; default test_spark_28675 false Database: default Table: test_spark_28675 Owner: root Created Time: Fri Aug 09 08:19:49 GMT-07:00 2019 Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969 Created By: Spark 3.0.0-SNAPSHOT Type: MANAGED Provider: org.apache.spark.sql.jdbc Location: file:/user/hive/warehouse/test_spark_28675 Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties: [url=*******(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] Schema: root \|-- id: integer (nullable = true) ``` Closes #25395 from wangyum/SPARK-28675. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-10 16:45:59 -07:00
Maxim Gekk	924d794a6f	[SPARK-28656][SQL] Support `millennium`, `century` and `decade` at `extract()` ## What changes were proposed in this pull request? In the PR, I propose new expressions `Millennium`, `Century` and `Decade`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. 2. `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. 3. `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. Here are examples: ```sql spark-sql> SELECT EXTRACT(MILLENNIUM FROM DATE '1981-01-19'); 2 spark-sql> SELECT EXTRACT(CENTURY FROM DATE '1981-01-19'); 20 spark-sql> SELECT EXTRACT(DECADE FROM DATE '1981-01-19'); 198 ``` ## How was this patch tested? Added new tests to `DateExpressionsSuite` and uncommented existing tests in `pgSQL/date.sql`. Closes #25388 from MaxGekk/extract-ext2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-09 11:18:50 -07:00
Shixiong Zhu	5bb69945e4	[SPARK-28651][SS] Force the schema of Streaming file source to be nullable ## What changes were proposed in this pull request? Right now, batch DataFrame always changes the schema to nullable automatically (See this line: `325bc8e9c6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L399)`). But streaming file source is missing this. This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior. ## How was this patch tested? The new unit test. Closes #25382 from zsxwing/SPARK-28651. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-09 18:54:55 +09:00
Maxim Gekk	997d153e54	[SPARK-28017][SQL] Support additional levels of truncations by DATE_TRUNC/TRUNC ## What changes were proposed in this pull request? I propose new levels of truncations for the `date_trunc()` and `trunc()` functions: 1. `MICROSECOND` and `MILLISECOND` truncate values of the `TIMESTAMP` type to microsecond and millisecond precision. 2. `DECADE`, `CENTURY` and `MILLENNIUM` truncate dates/timestamps to lowest date of current decade/century/millennium. Also the `WEEK` and `QUARTER` levels have been supported by the `trunc()` function. The function is implemented similarly to `date_trunc` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC to maintain feature parity with it. Here are examples of `TRUNC`: ```sql spark-sql> SELECT TRUNC('2015-10-27', 'DECADE'); 2010-01-01 spark-sql> set spark.sql.datetime.java8API.enabled=true; spark.sql.datetime.java8API.enabled true spark-sql> SELECT TRUNC('1999-10-27', 'millennium'); 1001-01-01 ``` Examples of `DATE_TRUNC`: ```sql spark-sql> SELECT DATE_TRUNC('CENTURY', '2015-03-05T09:32:05.123456'); 2001-01-01T00:00:00Z ``` ## How was this patch tested? Added new tests to `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`, and uncommented existing tests in `pgSQL/date.sql`. Closes #25336 from MaxGekk/date_truct-ext. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-09 12:29:44 +08:00
Burak Yavuz	c80430f5c9	[SPARK-28572][SQL] Simple analyzer checks for v2 table creation code paths ## What changes were proposed in this pull request? Adds checks around: - The existence of transforms in the table schema (even in nested fields) - Duplications of transforms - Case sensitivity checks around column names in the V2 table creation code paths. ## How was this patch tested? Unit tests. Closes #25305 from brkyvz/v2CreateTable. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-09 12:04:28 +08:00
Yuming Wang	3586cdd24d	[SPARK-28395][FOLLOW-UP][SQL] Make spark.sql.function.preferIntegralDivision internal ## What changes were proposed in this pull request? This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases. More details: https://github.com/apache/spark/pull/25158#discussion_r309764541 ## How was this patch tested? N/A Closes #25376 from wangyum/SPARK-28395-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-08 10:42:24 +09:00
Gengliang Wang	c88df2ccf6	[SPARK-28331][SQL] Catalogs.load() should be able to load built-in catalogs ## What changes were proposed in this pull request? In `Catalogs.load`, the `pluginClassName` in the following code ``` String pluginClassName = conf.getConfString("spark.sql.catalog." + name, null); ``` is always null for built-in catalogs, e.g there is a SQLConf entry `spark.sql.catalog.session`. This is because of https://github.com/apache/spark/pull/18852: SQLConf.conf.getConfString(key, null) always returns null. ## How was this patch tested? Apply code changes of https://github.com/apache/spark/pull/24768 and tried loading session catalog. Closes #25094 from gengliangwang/fixCatalogLoad. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-08-07 16:14:34 -07:00
Marco Gaido	8617bf6ff8	[SPARK-28470][SQL] Cast to decimal throws ArithmeticException on overflow ## What changes were proposed in this pull request? The flag `spark.sql.decimalOperations.nullOnOverflow` is not honored by the `Cast` operator. This means that a casting which causes an overflow currently returns `null`. The PR makes `Cast` respecting that flag, ie. when it is turned to false and a decimal overflow occurs, an exception id thrown. ## How was this patch tested? Added UT Closes #25253 from mgaido91/SPARK-28470. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-08 08:10:21 +09:00
Wenchen Fan	469423f338	[SPARK-28595][SQL] explain should not trigger partition listing ## What changes were proposed in this pull request? Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again. This is caused by `FileSourceScanExec`: 1. In its `toString`, it needs to report the number of partitions it reads. This needs to query the hive metastore. 2. In its `outputOrdering`, it needs to get all the files. This needs to query the hive metastore. This PR fixes by: 1. `toString` do not need to report the number of partitions it reads. We should report it via SQL metrics. 2. The `outputOrdering` is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck. ## How was this patch tested? existing tests Closes #25328 from cloud-fan/ui. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-07 19:14:25 +08:00
mcheah	44e607e921	[SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables ## What changes were proposed in this pull request? Implements the `DESCRIBE TABLE` logical and physical plans for data source v2 tables. ## How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25040 from mccheah/describe-table-v2. Authored-by: mcheah <mcheah@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-07 14:26:45 +08:00
Nik Vanderhoof	9e931e787d	[SPARK-27905][SQL] Add higher order function 'forall' ## What changes were proposed in this pull request? Add's the higher order function `forall`, which tests an array to see if a predicate holds for every element. The function is implemented in `org.apache.spark.sql.catalyst.expressions.ArrayForAll`. The function is added to the function registry under the pretty name `forall`. ## How was this patch tested? I've added appropriate unit tests for the new ArrayForAll expression in `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala`. Also added tests for the function in `sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala`. Not sure who is best to ask about this PR so: HyukjinKwon rxin gatorsmile ueshin srowen hvanhovell gatorsmile Closes #24761 from nvander1/feature/for_all. Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Co-authored-by: Nik <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-08-06 14:25:53 -07:00
Maxim Gekk	9e3aab8b95	[SPARK-28623][SQL] Support `dow`, `isodow` and `doy` by `extract()` ## What changes were proposed in this pull request? In the PR, I propose to use existing expressions `DayOfYear`, `WeekDay` and `DayOfWeek`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `dow` - the day of the week as Sunday (0) to Saturday (6) 2. `isodow` - the day of the week as Monday (1) to Sunday (7) 3. `doy` - the day of the year (1 - 365/366) Here are examples: ```sql spark-sql> SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40'); 5 spark-sql> SELECT EXTRACT(ISODOW FROM TIMESTAMP '2001-02-18 20:38:40'); 7 spark-sql> SELECT EXTRACT(DOY FROM TIMESTAMP '2001-02-16 20:38:40'); 47 ``` ## How was this patch tested? Updated `extract.sql`. Closes #25367 from MaxGekk/extract-ext. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-06 13:39:49 -07:00
HyukjinKwon	bab88c48b1	[SPARK-28622][SQL][PYTHON] Rename PullOutPythonUDFInJoinCondition to ExtractPythonUDFFromJoinCondition and move to 'Extract Python UDFs' ## What changes were proposed in this pull request? This PR targets to rename `PullOutPythonUDFInJoinCondition` to `ExtractPythonUDFFromJoinCondition` and move to 'Extract Python UDFs' together with other Python UDF related rules. Currently `PullOutPythonUDFInJoinCondition` rule is alone outside of other 'Extract Python UDFs' rules together. and the name `ExtractPythonUDFFromJoinCondition` is matched to existing Python UDF extraction rules. ## How was this patch tested? Existing tests should cover. Closes #25358 from HyukjinKwon/move-python-join-rule. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-08-05 23:36:35 -07:00
Jungtaek Lim (HeartSaVioR)	128ea37bda	[SPARK-28601][CORE][SQL] Use StandardCharsets.UTF_8 instead of "UTF-8" string representation, and get rid of UnsupportedEncodingException ## What changes were proposed in this pull request? This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used. This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible. This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings. ## How was this patch tested? Existing unit tests. Closes #25335 from HeartSaVioR/SPARK-28601. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-05 20:45:54 -07:00
Wenchen Fan	6fb79af48c	[SPARK-28344][SQL] detect ambiguous self-join and fail the query ## What changes were proposed in this pull request? This is an alternative solution of https://github.com/apache/spark/pull/24442 . It fails the query if ambiguous self join is detected, instead of trying to disambiguate it. The problem is that, it's hard to come up with a reasonable rule to disambiguate, the rule proposed by #24442 is mostly a heuristic. ### background of the self-join problem: This is a long-standing bug and I've seen many people complaining about it in JIRA/dev list. A typical example: ``` val df1 = … val df2 = df1.filter(...) df1.join(df2, df1("a") > df2("a")) // returns empty result ``` The root cause is, `Dataset.apply` is so powerful that users think it returns a column reference which can point to the column of the Dataset at anywhere. This is not true in many cases. `Dataset.apply` returns an `AttributeReference` . Different Datasets may share the same `AttributeReference`. In the example above, `df2` adds a Filter operator above the logical plan of `df1`, and the Filter operator reserves the output `AttributeReference` of its child. This means, `df1("a")` is exactly the same as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false. ### The rule to detect ambiguous column reference caused by self join: We can reuse the infra in #24442 : 1. each Dataset has a globally unique id. 2. the `AttributeReference` returned by `Dataset.apply` carries the ID and column position(e.g. 3rd column of the Dataset) via metadata. 3. the logical plan of a `Dataset` carries the ID via `TreeNodeTag` When self-join happens, the analyzer asks the right side plan of join to re-generate output attributes with new exprIds. Based on it, a simple rule to detect ambiguous self join is: 1. find all column references (i.e. `AttributeReference`s with Dataset ID and col position) in the root node of a query plan. 2. for each column reference, traverse the query plan tree, find a sub-plan that carries Dataset ID and the ID is the same as the one in the column reference. 3. get the corresponding output attribute of the sub-plan by the col position in the column reference. 4. if the corresponding output attribute has a different exprID than the column reference, then it means this sub-plan is on the right side of a self-join and has regenerated its output attributes. This is an ambiguous self join because the column reference points to a table being self-joined. ## How was this patch tested? existing tests and new test cases Closes #25107 from cloud-fan/new-self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-06 10:06:36 +08:00
Ryan Blue	0345f1174d	[SPARK-27661][SQL] Add SupportsNamespaces API ## What changes were proposed in this pull request? This adds an interface for catalog plugins that exposes namespace operations: * `listNamespaces` * `namespaceExists` * `loadNamespaceMetadata` * `createNamespace` * `alterNamespace` * `dropNamespace` ## How was this patch tested? API only. Existing tests for regressions. Closes #24560 from rdblue/SPARK-27661-add-catalog-namespace-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-08-04 21:29:40 -07:00
Xiao Li	10d4ffd577	[SPARK-28532][SPARK-28530][SQL][FOLLOWUP] Inline doc for FixedPoint(1) batches "Subquery" and "Join Reorder" ## What changes were proposed in this pull request? Explained why "Subquery" and "Join Reorder" optimization batches should be `FixedPoint(1)`, which was introduced in SPARK-28532 and SPARK-28530. ## How was this patch tested? Existing UTs. Closes #25320 from yeshengm/SPARK-28530-followup. Lead-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-08-02 14:23:41 -07:00
Sean Owen	b148bd5ccb	[SPARK-28519][SQL] Use StrictMath log, pow functions for platform independence ## What changes were proposed in this pull request? See discussion on the JIRA (and dev). At heart, we find that math.log and math.pow can actually return slightly different results across platforms because of hardware optimizations. For the actual SQL log and pow functions, I propose that we should use StrictMath instead to ensure the answers are already the same. (This should have the benefit of helping tests pass on aarch64.) Further, the atanh function (which is not part of java.lang.Math) can be implemented in a slightly different and more accurate way. ## How was this patch tested? Existing tests (which will need to be changed). Some manual testing locally to understand the numeric issues. Closes #25279 from srowen/SPARK-28519. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:55:44 -05:00
Liang-Chi Hsieh	77c7e91e02	[SPARK-28445][SQL][PYTHON] Fix error when PythonUDF is used in both group by and aggregate expression ## What changes were proposed in this pull request? When PythonUDF is used in group by, and it is also in aggregate expression, like ``` SELECT pyUDF(a + 1), COUNT(b) FROM testData GROUP BY pyUDF(a + 1) ``` It causes analysis exception in `CheckAnalysis`, like ``` org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. ``` First, `CheckAnalysis` can't check semantic equality between PythonUDFs. Second, even we make it possible, runtime exception will be thrown ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF1#8615 ... Cause: java.lang.RuntimeException: Couldn't find pythonUDF1#8615 in [cast(pythonUDF0#8614 as int)#8617,count(b#8599)#8607L] ``` The cause is, `ExtractPythonUDFs` extracts both PythonUDFs in group by and aggregate expression. The PythonUDFs are two different aliases now in the logical aggregate. In runtime, we can't bind the resulting expression in aggregate to its grouping and aggregate attributes. This patch proposes a rule `ExtractGroupingPythonUDFFromAggregate` to extract PythonUDFs in group by and evaluate them before aggregate. We replace the group by PythonUDF in aggregate expression with aliased result. The query plan of query `SELECT pyUDF(a + 1), pyUDF(COUNT(b)) FROM testData GROUP BY pyUDF(a + 1)`, like ``` == Optimized Logical Plan == Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L] +- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616] +- Aggregate [cast(groupingPythonUDF#8614 as int)], [cast(groupingPythonUDF#8614 as int) AS CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, count(b#8599) AS agg#8613L] +- Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599] +- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615] +- LocalRelation [a#8598, b#8599] == Physical Plan == (3) Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L] +- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616] +- (2) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int)#8617], functions=[count(b#8599)], output=[CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, agg#8613L]) +- Exchange hashpartitioning(cast(groupingPythonUDF#8614 as int)#8617, 5), true +- (1) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int) AS cast(groupingPythonUDF#8614 as int)#8617], functions=[partial_count(b#8599)], output=[cast(groupingPythonUDF#8614 as int)#8617, count#8619L]) +- (1) Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599] +- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615] +- LocalTableScan [a#8598, b#8599] ``` ## How was this patch tested? Added tests. Closes #25215 from viirya/SPARK-28445. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-02 19:47:29 +09:00
Yuming Wang	4e7a4cd20e	[SPARK-28521][SQL] Fix error message for built-in functions ## What changes were proposed in this pull request? ```sql spark-sql> select cast(1); 19/07/26 00:54:17 ERROR SparkSQLDriver: Failed in [select cast(1)] java.lang.UnsupportedOperationException: empty.init at scala.collection.TraversableLike$class.init(TraversableLike.scala:451) at scala.collection.mutable.ArrayOps$ofInt.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:234) at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:135) at scala.collection.mutable.ArrayOps$ofInt.init(ArrayOps.scala:234) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:565) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:558) at scala.Option.getOrElse(Option.scala:121) ``` The reason is that we did not handle the case [`validParametersCount.length == 0`](`2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L588)`) because the [parameter types](`2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L589)`) can be `Expression`, `DataType` and `Option`. This PR makes it handle the case `validParametersCount.length == 0`. ## How was this patch tested? unit tests Closes #25261 from wangyum/SPARK-28521. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-01 18:02:50 -05:00
Marco Gaido	ee41001949	[SPARK-26218][SQL] Overflow on arithmetic operations returns incorrect result ## What changes were proposed in this pull request? When an overflow occurs performing an arithmetic operation, we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard. ## How was this patch tested? added UT + existing UTs (improved) Closes #21599 from mgaido91/SPARK-24598. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-01 14:51:38 +08:00
Yuming Wang	3002a3bf3c	[SPARK-28581][SQL] Replace _FUNC_ in UDF ExpressionInfo ## What changes were proposed in this pull request? This PR moves `replaceFunctionName(usage: String, functionName: String)` from `DescribeFunctionCommand` to `ExpressionInfo` in order to make `ExpressionInfo` returns actual name instead of placeholder. We can get `ExpressionInfo`s directly through `SessionCatalog.lookupFunctionInfo` API and get the real names. ## How was this patch tested? unit tests Closes #25314 from wangyum/SPARK-28581. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-31 13:08:49 -07:00
gengjiaan	d03ec65f01	[SPARK-27924][SQL] Support ANSI SQL Boolean-Predicate syntax ## What changes were proposed in this pull request? This PR aims to support ANSI SQL `Boolean-Predicate` syntax. ```sql expression IS [NOT] TRUE expression IS [NOT] FALSE expression IS [NOT] UNKNOWN ``` There are some mainstream database support this syntax. - PostgreSQL: https://www.postgresql.org/docs/9.1/functions-comparison.html - Hive: https://issues.apache.org/jira/browse/HIVE-13583 - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html - Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm For example: ```sql spark-sql> select null is true, null is not true; false true spark-sql> select false is true, false is not true; false true spark-sql> select true is true, true is not true; true false spark-sql> select null is false, null is not false; false true spark-sql> select false is false, false is not false; true false spark-sql> select true is false, true is not false; false true spark-sql> select null is unknown, null is not unknown; true false spark-sql> select false is unknown, false is not unknown; false true spark-sql> select true is unknown, true is not unknown; false true ``` Note: A null input is treated as the logical value "unknown". ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #25074 from beliefer/ansi-sql-boolean-test. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-30 23:59:50 -07:00
Dilip Biswal	ee3c1c777d	[SPARK-28375][SQL] Make pullupCorrelatedPredicate idempotent ## What changes were proposed in this pull request? This PR makes the optimizer rule PullupCorrelatedPredicates idempotent. ## How was this patch tested? A new test PullupCorrelatedPredicatesSuite Closes #25268 from dilipbiswal/pr-25164. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-07-30 16:29:24 -07:00
Maxim Gekk	caa23e3efd	[SPARK-28459][SQL] Add `make_timestamp` function ## What changes were proposed in this pull request? New function `make_timestamp()` takes 6 columns `year`, `month`, `day`, `hour`, `min`, `sec` + optionally `timezone`, and makes new column of the `TIMESTAMP` type. If values in the input columns are `null` or out of valid ranges, the function returns `null`. Valid ranges are: - `year` - `[1, 9999]` - `month` - `[1, 12]` - `day` - `[1, 31]` - `hour` - `[0, 23]` - `min` - `[0, 59]` - `sec` - `[0, 60]`. If the `sec` argument equals to 60, the seconds field is set to 0 and 1 minute is added to the final timestamp. - `timezone` - an identifier of timezone. Actual database of timezones can be found there: https://www.iana.org/time-zones. Also constructed timestamp must be valid otherwise `make_timestamp` returns `null`. The function is implemented similarly to `make_timestamp` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html to maintain feature parity with it. Here is an example: ```sql select make_timestamp(2014, 12, 28, 6, 30, 45.887); 2014-12-28 06:30:45.887 select make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET'); 2014-12-28 10:30:45.887 select make_timestamp(2019, 6, 30, 23, 59, 60) 2019-07-01 00:00:00 ``` Returned value has Spark Catalyst type `TIMESTAMP` which is similar to Oracle's `TIMESTAMP WITH LOCAL TIME ZONE` (see https://docs.oracle.com/cd/B28359_01/server.111/b28298/ch4datetime.htm#i1006169) where data is stored in the session time zone, and the time zone offset is not stored as part of the column data. When users retrieve the data, Spark returns it in the session time zone specified by the SQL config `spark.sql.session.timeZone`. ## How was this patch tested? Added new tests to `DateExpressionsSuite`, and uncommented a test for `make_timestamp` in `pgSQL/timestamp.sql`. Closes #25220 from MaxGekk/make_timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-29 11:00:08 -07:00
Lee Dongjin	d98aa2a184	[MINOR] Trivial cleanups These are what I found during working on #22282. - Remove unused value: `UnsafeArraySuite#defaultTz` - Remove redundant new modifier to the case class, `KafkaSourceRDDPartition` - Remove unused variables from `RDD.scala` - Remove trailing space from `structured-streaming-kafka-integration.md` - Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default. - Remove leading empty line: `UnsafeRow` - Remove trailing empty line: `KafkaTestUtils` - Remove unthrown exception type: `UnsafeMapData` - Replace unused declarations: `expressions` - Remove duplicated default parameter: `AnalysisErrorSuite` - `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable Closes #25251 from dongjinleekr/cleanup/201907. Authored-by: Lee Dongjin <dongjin@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 23:38:02 +09:00
Dongjoon Hyun	18156d5503	[SPARK-28086][SQL] Add a function alias `random` for Rand ## What changes were proposed in this pull request? This PR aims to add a SQL function alias `random` to the existing `rand` function. Please note that this adds the alias to SQL layer only because this is for PostgreSQL feature parity. - [PostgreSQL Random function](https://www.postgresql.org/docs/11/functions-math.html) - [SPARK-23160 Port window.sql](https://github.com/apache/spark/pull/24881/files#diff-14489bae6b27814d4cde0456a7ae75c8R702) - [SPARK-28406 Port union.sql](https://github.com/apache/spark/pull/25163/files#diff-23a3430e0e1ff88830cbb43701da1f2cR402) ## How was this patch tested? Manual. ```sql spark-sql> DESCRIBE FUNCTION random; Function: random Class: org.apache.spark.sql.catalyst.expressions.Rand Usage: random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1). ``` Closes #25282 from dongjoon-hyun/SPARK-28086. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 20:17:30 +09:00
Maxim Gekk	a5a5da78cf	[SPARK-28471][SQL] Replace `yyyy` by `uuuu` in date-timestamp patterns without era ## What changes were proposed in this pull request? In the PR, I propose to use `uuuu` for years instead of `yyyy` in date/timestamp patterns without the era pattern `G` (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html). Parsing/formatting of positive years (current era) will be the same. The difference is in formatting negative years belong to previous era - BC (Before Christ). I replaced the `yyyy` pattern by `uuuu` everywhere except: 1. Test, Suite & Benchmark. Existing tests must work as is. 2. `SimpleDateFormat` because it doesn't support the `uuuu` pattern. 3. Comments and examples (except comments related to already replaced patterns). Before the changes, the year of common era `100` and the year of BC era `-99`, showed similarly as `100`. After the changes negative years will be formatted with the `-` sign. Before: ```Scala scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show +----------+ \| value\| +----------+ \|0100-01-01\| +----------+ ``` After: ```Scala scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show +-----------+ \| value\| +-----------+ \|-0099-01-01\| +-----------+ ``` ## How was this patch tested? By existing test suites, and added tests for negative years to `DateFormatterSuite` and `TimestampFormatterSuite`. Closes #25230 from MaxGekk/year-pattern-uuuu. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-28 20:36:36 -07:00
Dongjoon Hyun	a428f40669	[SPARK-28549][BUILD][CORE][SQL] Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` ## What changes were proposed in this pull request? `org.apache.commons.lang3.StringEscapeUtils` was deprecated over two years ago at [LANG-1316](https://issues.apache.org/jira/browse/LANG-1316). There is no bug fixes after that. ```java /** * <p>Escapes and unescapes {code String}s for * Java, Java Script, HTML and XML.</p> * * <p>#ThreadSafe#</p> * since 2.0 * deprecated as of 3.6, use commons-text * <a href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html"> * StringEscapeUtils</a> instead */ Deprecated public class StringEscapeUtils { ``` This PR aims to use the latest one from `commons-text` module which has more bug fixes like [TEXT-100](https://issues.apache.org/jira/browse/TEXT-100), [TEXT-118](https://issues.apache.org/jira/browse/TEXT-118) and [TEXT-120](https://issues.apache.org/jira/browse/TEXT-120) by the following replacement. ```scala -import org.apache.commons.lang3.StringEscapeUtils +import org.apache.commons.text.StringEscapeUtils ``` This will add a new dependency to `hadoop-2.7` profile distribution. In `hadoop-3.2` profile, we already have it. ``` +commons-text-1.6.jar ``` ## How was this patch tested? Pass the Jenkins with the existing tests. - [Hadoop 2.7](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108281) - [Hadoop 3.2](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108282) Closes #25281 from dongjoon-hyun/SPARK-28549. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 11:45:29 +09:00
Yuming Wang	9eb541be22	[SPARK-28424][SQL] Support typed interval expression ## What changes were proposed in this pull request? This PR add support typed `interval` expression: ```sql spark-sql> select interval 'interval 3 year 1 hour'; interval 3 years 1 hours spark-sql> ``` Please note that this pr did not add a cast alias for `interval` type like [other types](`2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L529-L541)`) because neither PostgreSQL nor Hive supports this syntax. ## How was this patch tested? unit tests Closes #25241 from wangyum/SPARK-28424. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-27 14:25:35 -07:00
Yesheng Ma	d4e246658a	[SPARK-28530][SQL] Cost-based join reorder optimizer batch should be FixedPoint(1) ## What changes were proposed in this pull request? Since for AQP the cost for joins can change between multiple runs, there is no reason that we have an idempotence enforcement on this optimizer batch. We thus make it `FixedPoint(1)` instead of `Once`. ## How was this patch tested? Existing UTs. Closes #25266 from yeshengm/SPARK-28530. Lead-authored-by: Yesheng Ma <kimi.ysma@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-07-26 22:57:39 -07:00
Yesheng Ma	e037a11494	[SPARK-28532][SQL] Make optimizer batch "subquery" FixedPoint(1) ## What changes were proposed in this pull request? In the Catalyst optimizer, the batch subquery actually calls the optimizer recursively. Therefore it makes no sense to enforce idempotence on it and we change this batch to `FixedPoint(1)`. ## How was this patch tested? Existing UTs. Closes #25267 from yeshengm/SPARK-28532. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-07-26 22:48:42 -07:00
Liang-Chi Hsieh	558dd23601	[SPARK-28441][SQL][PYTHON] Fix error when non-foldable expression is used in correlated scalar subquery ## What changes were proposed in this pull request? In SPARK-15370, We checked the expression at the root of the correlated subquery, in order to fix count bug. If a `PythonUDF` in in the checking path, evaluating it causes the failure as we can't statically evaluate `PythonUDF`. The Python UDF test added at SPARK-28277 shows this issue. If we can statically evaluate the expression, we intercept NULL values coming from the outer join and replace them with the value that the subquery's expression like before, if it is not, we replace them with the `PythonUDF` expression, with statically evaluated parameters. After this, the last query in `udf-except.sql` which throws `java.lang.UnsupportedOperationException` can be run: ``` SELECT t1.k FROM t1 WHERE t1.v <= (SELECT udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k)) MINUS SELECT t1.k FROM t1 WHERE udf(t1.v) >= (SELECT min(udf(t2.v)) FROM t2 WHERE t2.k = t1.k) -- !query 2 schema struct<k:string> -- !query 2 output two ``` Note that this issue is also for other non-foldable expressions, like rand. As like PythonUDF, we can't call `eval` on this kind of expressions in optimization. The evaluation needs to defer to query runtime. ## How was this patch tested? Added tests. Closes #25204 from viirya/SPARK-28441. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-27 10:38:34 +08:00

1 2 3 4 5 ...

3737 commits