ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuming Wang	56dcd79992	Preparing development version 3.0.1-SNAPSHOT	2019-12-17 01:57:27 +00:00
Yuming Wang	c216ef1d03	Preparing Spark release v3.0.0-preview2-rc2	2019-12-17 01:57:21 +00:00
Maxim Gekk	b03ce63c05	[SPARK-30258][TESTS] Eliminate warnings of deprecated Spark APIs in tests ### What changes were proposed in this pull request? In the PR, I propose to move all tests that use deprecated Spark APIs to separate test classes, and add the annotation: ```scala deprecated("This test suite will be removed.", "3.0.0") ``` The annotation suppress warnings from already deprecated methods and classes. ### Why are the changes needed? The warnings about deprecated Spark APIs in tests does not indicate any issues because the tests use such APIs intentionally. Eliminating the warnings allows to highlight other warnings that could show real problems. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites and by - DeprecatedAvroFunctionsSuite - DeprecatedDateFunctionsSuite - DeprecatedDatasetAggregatorSuite - DeprecatedStreamingAggregationSuite - DeprecatedWholeStageCodegenSuite Closes #26885 from MaxGekk/eliminate-deprecate-warnings. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-16 18:24:32 -06:00
Niranjan Artal	dddfeca175	[SPARK-30209][SQL][WEB-UI] Display stageId, attemptId and taskId for max metrics in Spark UI ### What changes were proposed in this pull request? SPARK-30209 discusses about adding additional metrics such as stageId, attempId and taskId for max metrics. We have the data required to display in LiveStageMetrics. Need to capture and pass these metrics to display on the UI. To minimize memory used for variables, we are saving maximum of each metric id per stage. So per stage additional memory usage is (#metrics * 4 * sizeof(Long)). Then max is calculated for each metric id among all stages which is passed in the stringValue method. Memory used is minimal. Ran the benchmark for runtime. Stage.Proc time has increased to around 1.5-2.5x but the Aggregate time has decreased. ### Why are the changes needed? These additional metrics stageId, attemptId and taskId could help in debugging the jobs quicker. For a given operator, it will be easy to identify the task which is taking maximum time to complete from the SQL tab itself. ### Does this PR introduce any user-facing change? Yes. stageId, attemptId and taskId is shown only for executor side metrics. For driver metrics, "(driver)" is displayed on UI. ![image (3)](https://user-images.githubusercontent.com/50492963/70763041-929d9980-1d07-11ea-940f-88ac6bdce9b5.png) "Driver" ![image (4)](https://user-images.githubusercontent.com/50492963/70763043-94675d00-1d07-11ea-95ab-3478728cb435.png) ### How was this patch tested? Manually tested, ran benchmark script for runtime. Closes #26843 from nartal1/SPARK-30209. Authored-by: Niranjan Artal <nartal@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-12-16 15:27:34 -06:00
HyukjinKwon	23b1312324	[SPARK-30200][DOCS][FOLLOW-UP] Add documentation for explain(mode: String) ### What changes were proposed in this pull request? This PR adds the documentation of the new `mode` added to `Dataset.explain`. ### Why are the changes needed? To let users know the new modes. ### Does this PR introduce any user-facing change? No (doc-only change). ### How was this patch tested? Manually built the doc: ![Screen Shot 2019-12-16 at 3 34 28 PM](https://user-images.githubusercontent.com/6477701/70884617-d64f1680-2019-11ea-9336-247ade7f8768.png) Closes #26903 from HyukjinKwon/SPARK-30200-doc. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 21:35:37 +09:00
Wenchen Fan	fdcd0e71b9	[SPARK-30192][SQL] support column position in DS v2 ### What changes were proposed in this pull request? update DS v2 API to support add/alter column with column position ### Why are the changes needed? We have a parser rule for column position, but we fail the query if it's specified, because the builtin catalog can't support add/alter column with column position. Since we have the catalog plugin API now, we should let the catalog implementation to decide if it supports column position or not. ### Does this PR introduce any user-facing change? not yet ### How was this patch tested? new tests Closes #26817 from cloud-fan/parser. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-16 18:55:17 +08:00
Boris Boutkov	3bf5498b4a	[MINOR][DOCS] Fix documentation for slide function ### What changes were proposed in this pull request? This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info. ### Why are the changes needed? Documentation improvement ### Does this PR introduce any user-facing change? No (doc-only change). ### How was this patch tested? Manually tested by documentation build. Closes #26896 from bboutkov/pyspark_doc_fix. Authored-by: Boris Boutkov <boris.boutkov@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 16:29:09 +09:00
HyukjinKwon	0a2afcec7d	[SPARK-30200][SQL][FOLLOW-UP] Expose only explain(mode: String) in Scala side, and clean up related codes ### What changes were proposed in this pull request? This PR mainly targets: 1. Expose only explain(mode: String) in Scala side 2. Clean up related codes - Hide `ExplainMode` under private `execution` package. No particular reason but just because `ExplainUtils` exists there - Use `case object` + `trait` pattern in `ExplainMode` to look after `ParseMode`. - Move `Dataset.toExplainString` to `QueryExecution.explainString` to look after `QueryExecution.simpleString`, and deduplicate the codes at `ExplainCommand`. - Use `ExplainMode` in `ExplainCommand` too. - Add `explainString` to `PythonSQLUtils` to avoid unexpected test failure of PySpark during refactoring Scala codes side. ### Why are the changes needed? To minimised exposed APIs, deduplicate, and clean up. ### Does this PR introduce any user-facing change? `Dataset.explain(mode: ExplainMode)` will be removed (which only exists in master). ### How was this patch tested? Manually tested and existing tests should cover. Closes #26898 from HyukjinKwon/SPARK-30200-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 14:42:35 +09:00
Maxim Gekk	67b644c3d7	[SPARK-30166][SQL] Eliminate compilation warnings in JSONOptions ### What changes were proposed in this pull request? In the PR, I propose to replace `setJacksonOptions()` in `JSONOptions` by `buildJsonFactory()` which builds `JsonFactory` using `JsonFactoryBuilder`. This allows to avoid using deprecated feature configurations from `JsonParser.Feature`. ### Why are the changes needed? - The changes eliminate the following compilation warnings in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala`: ``` Warning:Warning:line (137)Java enum ALLOW_NUMERIC_LEADING_ZEROS in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_NUMERIC_LEADING_ZEROS, allowNumericLeadingZeros) Warning:Warning:line (138)Java enum ALLOW_NON_NUMERIC_NUMBERS in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, allowNonNumericNumbers) Warning:Warning:line (139)Java enum ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, Warning:Warning:line (141)Java enum ALLOW_UNQUOTED_CONTROL_CHARS in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars) ``` - This put together building JsonFactory and set options from JSONOptions. So, we will not forget to call `setJacksonOptions` in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `JsonSuite`, `JsonFunctionsSuite`, `JsonExpressionsSuite`. Closes #26797 from MaxGekk/eliminate-warning. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-15 08:45:57 -06:00
fuwhu	4cbef8988e	[SPARK-30259][SQL] Fix CREATE TABLE behavior when session catalog is specified explicitly ### What changes were proposed in this pull request? Fix bug : CREATE TABLE throw error when session catalog specified explicitly. ### Why are the changes needed? Currently, Spark throw error when the session catalog is specified explicitly in "CREATE TABLE" and "CREATE TABLE AS SELECT" command, eg. > CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; the error message is like below: > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog > 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException > Error in query: Database 'spark_catalog' not found; ### Does this PR introduce any user-facing change? Yes, after this PR, "CREATE TALBE" and "CREATE TABLE AS SELECT" can complete successfully when session catalog "spark_catalog" specified explicitly. ### How was this patch tested? New unit tests added. Closes #26887 from fuwhu/SPARK-30259. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 15:36:14 -08:00
Takeshi Yamamuro	f483a13d4a	[SPARK-30231][SQL][PYTHON][FOLLOWUP] Make error messages clear in PySpark df.explain ### What changes were proposed in this pull request? This pr is a followup of #26861 to address minor comments from viirya. ### Why are the changes needed? For better error messages. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Closes #26886 from maropu/SPARK-30231-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 14:26:50 -08:00
Kent Yao	d3ec8b1735	[SPARK-30066][SQL] Support columnar execution on interval types ### What changes were proposed in this pull request? Columnar execution support for interval types ### Why are the changes needed? support cache tables with interval columns improve performance too ### Does this PR introduce any user-facing change? Yes cache table with accept interval columns ### How was this patch tested? add ut Closes #26699 from yaooqinn/SPARK-30066. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 13:10:46 -08:00
Burak Yavuz	4c37a8a3f4	[SPARK-30143][SS] Add a timeout on stopping a streaming query ### What changes were proposed in this pull request? Add a timeout configuration for StreamingQuery.stop() ### Why are the changes needed? The stop() method on a Streaming Query awaits the termination of the stream execution thread. However, the stream execution thread may block forever depending on the streaming source implementation (like in Kafka, which runs UninterruptibleThreads). This causes control flow applications to hang indefinitely as well. We'd like to introduce a timeout to stop the execution thread, so that the control flow thread can decide to do an action if a timeout is hit. ### Does this PR introduce any user-facing change? By default, no. If the timeout configuration is set, then a TimeoutException will be thrown if a stream cannot be stopped within the given timeout. ### How was this patch tested? Unit tests Closes #26771 from brkyvz/stopTimeout. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-12-13 15:16:00 -08:00
Terry Kim	ac9b1881a2	[SPARK-30248][SQL] Fix DROP TABLE behavior when session catalog name is provided in the identifier ### What changes were proposed in this pull request? If a table name is qualified with session catalog name `spark_catalog`, the `DROP TABLE` command fails. For example, the following ``` sql("CREATE TABLE tbl USING json AS SELECT 1 AS i") sql("DROP TABLE spark_catalog.tbl") ``` fails with: ``` org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spark_catalog' not found; at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:45) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:336) ``` This PR correctly resolves `spark_catalog` as a catalog. ### Why are the changes needed? It's fixing a bug. ### Does this PR introduce any user-facing change? Yes, now, the `spark_catalog.tbl` in the above example is dropped as expected. ### How was this patch tested? Added a test. Closes #26878 from imback82/fix_drop_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-13 21:45:35 +08:00
Takeshi Yamamuro	64c7b94d64	[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain ### What changes were proposed in this pull request? This pr intends to support explain modes implemented in #26829 for PySpark. ### Why are the changes needed? For better debugging info. in PySpark dataframes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #26861 from maropu/ExplainModeInPython. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-13 17:44:23 +09:00
Jungtaek Lim (HeartSaVioR)	94eb66593a	[SPARK-30227][SQL] Add close() on DataWriter interface ### What changes were proposed in this pull request? This patch adds close() method to the DataWriter interface, which will become the place to cleanup the resource. ### Why are the changes needed? The lifecycle of DataWriter instance ends at either commit() or abort(). That makes datasource implementors to feel they can place resource cleanup in both sides, but abort() can be called when commit() fails; so they have to ensure they don't do double-cleanup if cleanup is not idempotent. ### Does this PR introduce any user-facing change? Depends on the definition of user; if they're developers of custom DSv2 source, they have to add close() in their DataWriter implementations. It's OK to just add close() with empty content as they should have already dealt with resource cleanup in commit/abort, but they would love to migrate the resource cleanup logic to close() as it avoids double cleanup. If they're just end users using the provided DSv2 source (regardless of built-in/3rd party), no change. ### How was this patch tested? Existing tests. Closes #26855 from HeartSaVioR/SPARK-30227. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-13 16:12:41 +08:00
Pablo Langa	cb6d2b3f83	[SPARK-30040][SQL] DROP FUNCTION should do multi-catalog resolution ### What changes were proposed in this pull request? Add DropFunctionStatement and make DROP FUNCTION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing DROP FUNCTION namespace.function ### Does this PR introduce any user-facing change? Yes. When running DROP FUNCTION namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26854 from planga82/feature/SPARK-30040_DropFunctionV2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-12 15:15:54 -08:00
Anton Okolnychyi	5114389aef	[SPARK-30107][SQL] Expose nested schema pruning to all V2 sources ### What changes were proposed in this pull request? This PR exposes the existing logic for nested schema pruning to all sources, which is in line with the description of `SupportsPushDownRequiredColumns` . Right now, `SchemaPruning` (rule, not helper utility) is applied in the optimizer directly on certain instances of `Table` ignoring `SupportsPushDownRequiredColumns` that is part of `ScanBuilder`. I think it would be cleaner to perform schema pruning and filter push-down in one place. Therefore, this PR moves all the logic into `V2ScanRelationPushDown`. ### Why are the changes needed? This change allows all V2 data sources to benefit from nested column pruning (if they support it). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This PR mostly relies on existing tests. On top, it adds one test to verify that top-level schema pruning works as well as one test for predicates with subqueries. Closes #26751 from aokolnychyi/nested-schema-pruning-ds-v2. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-12 13:40:46 -08:00
HyukjinKwon	cc087a3ac5	[SPARK-30162][SQL] Add PushedFilters to metadata in Parquet DSv2 implementation ### What changes were proposed in this pull request? This PR proposes to add `PushedFilters` into metadata to show the pushed filters in Parquet DSv2 implementation. In case of ORC, it is already added at https://github.com/apache/spark/pull/24719/files#diff-0fc82694b20da3cd2cbb07206920eef7R62-R64 ### Why are the changes needed? In order for users to be able to debug, and to match with ORC. ### Does this PR introduce any user-facing change? ```scala spark.range(10).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").filter("5 > id").explain() ``` Before: ``` == Physical Plan == (1) Project [id#20L] +- (1) Filter (isnotnull(id#20L) AND (5 > id#20L)) +- (1) ColumnarToRow +- BatchScan[id#20L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint> ``` After:* ``` == Physical Plan == (1) Project [id#13L] +- (1) Filter (isnotnull(id#13L) AND (5 > id#13L)) +- *(1) ColumnarToRow +- BatchScan[id#13L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint>, PushedFilters: [IsNotNull(id), LessThan(id,5)] ``` ### How was this patch tested? Unittest were added and manually tested. Closes #26857 from HyukjinKwon/SPARK-30162. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-12 08:33:33 -08:00
Aaron Lau	fd39b6db34	[SQL] Typo in HashedRelation error ### What changes were proposed in this pull request? Fixed typo in exception message of HashedRelations ### Why are the changes needed? Better exception messages ### Does this PR introduce any user-facing change? No ### How was this patch tested? No tests needed Closes #26822 from aaron-lau/master. Authored-by: Aaron Lau <aaron.lau@datadoghq.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-12 08:42:18 -06:00
root1	2936507f94	[SPARK-30150][SQL] ADD FILE, ADD JAR, LIST FILE & LIST JAR Command do not accept quoted path ### What changes were proposed in this pull request? `add file "abc.txt"` and `add file 'abc.txt'` are not supported. For these two spark sql gives `FileNotFoundException`. Only `add file abc.txt` is supported currently. After these changes path can be given as quoted text for ADD FILE, ADD JAR, LIST FILE, LIST JAR commands in spark-sql ### Why are the changes needed? In many of the spark-sql commands (like create table ,etc )we write path in quoted format only. To maintain this consistency we should support quoted format with this command as well. ### Does this PR introduce any user-facing change? Yes. Now users can write path with quotes. ### How was this patch tested? Manually tested. Closes #26779 from iRakson/SPARK-30150. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-12 17:11:21 +08:00
Terry Kim	3741a36ebf	[SPARK-30104][SQL][FOLLOWUP] V2 catalog named 'global_temp' should always be masked ### What changes were proposed in this pull request? This is a follow up to #26741 to address the following: 1. V2 catalog named `global_temp` should always be masked. 2. #26741 introduces `CatalogAndIdentifer` that supersedes `CatalogObjectIdentfier`. This PR removes `CatalogObjectIdentfier` and its usages and replace them with `CatalogAndIdentifer`. 3. `CatalogObjectIdentifier(catalog, ident) if !isSessionCatalog(catalog)` and `CatalogObjectIdentifier(catalog, ident) if isSessionCatalog(catalog)` are replaced with `NonSessionCatalogAndIdentifier` and `SessionCatalogAndIdentifier` respectively. ### Why are the changes needed? To fix an existing with handling v2 catalog named `global_temp` and to simplify the code base. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new tests. Closes #26853 from imback82/lookup_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-12 14:47:20 +08:00
jiake	1ced6c1544	[SPARK-30213][SQL] Remove the mutable status in ShuffleQueryStageExec ### What changes were proposed in this pull request? Currently `ShuffleQueryStageExec `contain the mutable status, eg `mapOutputStatisticsFuture `variable. So It is not easy to pass when we copy `ShuffleQueryStageExec`. This PR will put the `mapOutputStatisticsFuture ` variable from `ShuffleQueryStageExec` to `ShuffleExchangeExec`. And then we can pass the value of `mapOutputStatisticsFuture ` when copying. ### Why are the changes needed? In order to remove the mutable status in `ShuffleQueryStageExec` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing uts Closes #26846 from JkSelf/removeMutableVariable. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 19:39:31 -08:00
Pablo Langa	9cf9304e17	[SPARK-30038][SQL] DESCRIBE FUNCTION should do multi-catalog resolution ### What changes were proposed in this pull request? Add DescribeFunctionsStatement and make DESCRIBE FUNCTIONS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing DESCRIBE FUNCTIONS namespace.function ### Does this PR introduce any user-facing change? Yes. When running DESCRIBE FUNCTIONS namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26840 from planga82/feature/SPARK-30038_DescribeFunction_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 14:02:58 -08:00
Sean Owen	33f53cb2d5	[SPARK-30195][SQL][CORE][ML] Change some function, import definitions to work with stricter compiler in Scala 2.13 ### What changes were proposed in this pull request? See https://issues.apache.org/jira/browse/SPARK-30195 for the background; I won't repeat it here. This is sort of a grab-bag of related issues. ### Why are the changes needed? To cross-compile with Scala 2.13 later. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests for 2.12. I've been manually checking that this actually resolves the compile problems in 2.13 separately. Closes #26826 from srowen/SPARK-30195. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 12:33:58 -08:00
Maxim Gekk	e933539cdd	[SPARK-29864][SPARK-29920][SQL] Strict parsing of day-time strings to intervals ### What changes were proposed in this pull request? In the PR, I propose new implementation of `fromDayTimeString` which strictly parses strings in day-time formats to intervals. New implementation accepts only strings that match to a pattern defined by the `from` and `to`. Here is the mapping of user's bounds and patterns: - `[+\|-]D+ H[H]:m[m]:s[s][.SSSSSSSSS]` for DAY TO SECOND - `[+\|-]D+ H[H]:m[m]` for DAY TO MINUTE - `[+\|-]D+ H[H]` for DAY TO HOUR - `[+\|-]H[H]:m[m]s[s][.SSSSSSSSS]` for HOUR TO SECOND - `[+\|-]H[H]:m[m]` for HOUR TO MINUTE - `[+\|-]m[m]:s[s][.SSSSSSSSS]` for MINUTE TO SECOND Closes #26327 Closes #26358 ### Why are the changes needed? - Improve user experience with Spark SQL, and respect to the bound specified by users. - Behave the same as other broadly used DBMS - Oracle and MySQL. ### Does this PR introduce any user-facing change? Yes, before: ```sql spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE; interval 1 weeks 3 days 11 hours 12 minutes ``` After: ```sql spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE; Error in query: requirement failed: Interval string must match day-time format of '^(?<sign>[+\|-])?(?<hour>\d{1,2}):(?<minute>\d{1,2})$': 10 11:12:13.123(line 1, pos 16) == SQL == SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE ----------------^^^ ``` ### How was this patch tested? - Added tests to `IntervalUtilsSuite` - By `ExpressionParserSuite` - Updated `literals.sql` Closes #26473 from MaxGekk/strict-from-daytime-string. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-12 01:08:53 +08:00
Takeshi Yamamuro	a59cb13cda	[SPARK-30200][SQL][FOLLOWUP] Fix typo in ExplainMode ### What changes were proposed in this pull request? This pr is a follow-up of #26829 to fix typos in ExplainMode. ### Why are the changes needed? For better docs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #26851 from maropu/SPARK-30200-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 08:17:53 -08:00
Terry Kim	beae14d5ed	[SPARK-30104][SQL] Fix catalog resolution for 'global_temp' ### What changes were proposed in this pull request? `global_temp` is used as a database name to access global temp views. The current catalog lookup logic considers only the first element of multi-part name when it resolves a catalog. This results in using the session catalog even `global_temp` is used as a table name under v2 catalog. This PR addresses this by making sure multi-part name has two elements before using the session catalog. ### Why are the changes needed? Currently, 'global_temp' can be used as a table name in certain commands (CREATE) but not in others (DESCRIBE): ``` // Assume "spark.sql.globalTempDatabase" is set to "global_temp". sql(s"CREATE TABLE testcat.t (id bigint, data string) USING foo") sql(s"CREATE TABLE testcat.global_temp (id bigint, data string) USING foo") sql("USE testcat") sql(s"DESCRIBE TABLE t").show +---------------+---------+-------+ \| col_name\|data_type\|comment\| +---------------+---------+-------+ \| id\| bigint\| \| \| data\| string\| \| \| \| \| \| \| # Partitioning\| \| \| \|Not partitioned\| \| \| +---------------+---------+-------+ sql(s"DESCRIBE TABLE global_temp").show org.apache.spark.sql.AnalysisException: Table not found: global_temp;; 'DescribeTable 'UnresolvedV2Relation [global_temp], org.apache.spark.sql.connector.InMemoryTableSessionCatalog2f1af64f, `global_temp`, false at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122) ``` ### Does this PR introduce any user-facing change? Yes, `sql(s"DESCRIBE TABLE global_temp").show` in the above example now displays: ``` +---------------+---------+-------+ \| col_name\|data_type\|comment\| +---------------+---------+-------+ \| id\| bigint\| \| \| data\| string\| \| \| \| \| \| \| # Partitioning\| \| \| \|Not partitioned\| \| \| +---------------+---------+-------+ ``` instead of throwing an exception. ### How was this patch tested? Added new tests. Closes #26741 from imback82/global_temp. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 16:56:42 +08:00
Sean Owen	3cc55f6a0a	[SPARK-29392][CORE][SQL][FOLLOWUP] More removal of 'foo Symbol syntax for Scala 2.13 ### What changes were proposed in this pull request? Another continuation of https://github.com/apache/spark/pull/26748 ### Why are the changes needed? To cleanly cross compile with Scala 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests Closes #26842 from srowen/SPARK-29392.4. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 19:41:24 -08:00
Kent Yao	8f0eb7dc86	[SPARK-29587][SQL] Support SQL Standard type real as float(4) numeric as decimal ### What changes were proposed in this pull request? The types decimal and numeric are equivalent. Both types are part of the SQL standard. the real type is 4 bytes, variable-precision, inexact, 6 decimal digits precision, same as our float, part of the SQL standard. ### Why are the changes needed? improve sql standard support other dbs https://www.postgresql.org/docs/9.3/datatype-numeric.html https://prestodb.io/docs/current/language/types.html#floating-point http://www.sqlservertutorial.net/sql-server-basics/sql-server-data-types/ MySQL treats REAL as a synonym for DOUBLE PRECISION (a nonstandard variation), unless the REAL_AS_FLOAT SQL mode is enabled. In MySQL, NUMERIC is implemented as DECIMAL, so the following remarks about DECIMAL apply equally to NUMERIC. ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #26537 from yaooqinn/SPARK-29587. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 02:22:08 +08:00
Takeshi Yamamuro	6103cf1960	[SPARK-30200][SQL] Add ExplainMode for Dataset.explain ### What changes were proposed in this pull request? This pr intends to add `ExplainMode` for explaining `Dataset/DataFrame` with a given format mode (`ExplainMode`). `ExplainMode` has four types along with the SQL EXPLAIN command: `Simple`, `Extended`, `Codegen`, `Cost`, and `Formatted`. For example, this pr enables users to explain DataFrame/Dataset with the `FORMATTED` format implemented in #24759; ``` scala> spark.range(10).groupBy("id").count().explain(ExplainMode.Formatted) == Physical Plan == * HashAggregate (3) +- * HashAggregate (2) +- * Range (1) (1) Range [codegen id : 1] Output: [id#0L] (2) HashAggregate [codegen id : 1] Input: [id#0L] (3) HashAggregate [codegen id : 1] Input: [id#0L, count#8L] ``` This comes from [the cloud-fan suggestion.](https://github.com/apache/spark/pull/24759#issuecomment-560211270) ### Why are the changes needed? To follow the SQL EXPLAIN command. ### Does this PR introduce any user-facing change? No, this is just for a new API in Dataset. ### How was this patch tested? Add tests in `ExplainSuite`. Closes #26829 from maropu/DatasetExplain. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 09:51:29 -08:00
Yuanjian Li	d9b3069412	[SPARK-30125][SQL] Remove PostgreSQL dialect ### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 01:22:34 +08:00
Anton Okolnychyi	a9f1809a2a	[SPARK-30206][SQL] Rename normalizeFilters in DataSourceStrategy to be generic ### What changes were proposed in this pull request? This PR renames `normalizeFilters` in `DataSourceStrategy` to be more generic as the logic is not specific to filters. ### Why are the changes needed? These changes are needed to support PR #26751. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26830 from aokolnychyi/rename-normalize-exprs. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 07:49:22 -08:00
yi.wu	aa9da9365f	[SPARK-30151][SQL] Issue better error message when user-specified schema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by https://github.com/apache/spark/pull/25248#issuecomment-559594305, user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-10 20:56:21 +08:00
Sean Owen	36fa1980c2	[SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 compatibility; remove WrappedArray ### What changes were proposed in this pull request? Use Seq instead of Array in sc.parallelize, with reference types. Remove usage of WrappedArray. ### Why are the changes needed? These both enable building on Scala 2.13. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #26787 from srowen/SPARK-30158. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-09 14:41:48 -06:00
Jungtaek Lim (HeartSaVioR)	538b8d101c	[SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports ### What changes were proposed in this pull request? This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate. ### Why are the changes needed? Java linter starts complaining. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? lint-java passed locally This closes #26819 Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-09 08:57:20 -08:00
Gengliang Wang	a717d219a6	[SPARK-30159][SQL][TESTS] Fix the method calls of `QueryTest.checkAnswer` ### What changes were proposed in this pull request? Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen. The actual exceptions are thrown in the trait `QueryTest`. However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers. After this PR: 1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results. 2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message. 3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`. ### Why are the changes needed? We should fix these method calls to perform actual validations in test suites. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26788 from gengliangwang/fixCheckAnswer. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-09 22:19:08 +09:00
Aman Omer	dcea7a4c9a	[SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and bool_or() ### What changes were proposed in this pull request? This PR introduces a method `expressionWithAlias` in class `FunctionRegistry` which is used to register function's constructor. Currently, `expressionWithAlias` is used to register `BoolAnd` & `BoolOr`. ### Why are the changes needed? Error message is wrong when alias name is used for `BoolAnd` & `BoolOr`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually. For query, `select every('true');` Output before this PR, > Error in query: cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7; After this PR, > Error in query: cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7; Closes #26712 from amanomer/29883. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-09 13:23:16 +08:00
Pablo Langa	bca9de6684	[SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands. We don’t have this methods in the catalog to implement an V2 command * catalog.listFunctions ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing `SHOW FUNCTIONS LIKE namespace.function` ### Does this PR introduce any user-facing change? Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-12-08 20:15:09 -08:00
Kent Yao	e88d74052b	[SPARK-30147][SQL] Trim the string when cast string type to booleans ### What changes were proposed in this pull request? Now, we trim the string when casting string value to those `canCast` types values, e.g. int, double, decimal, interval, date, timestamps, except for boolean. This behavior makes type cast and coercion inconsistency in Spark. Not fitting ANSI SQL standard either. ``` If TD is boolean, then Case: a) If SD is character string, then SV is replaced by TRIM ( BOTH ' ' FROM VE ) Case: i) If the rules for literal in Subclause 5.3, “literal”, can be applied to SV to determine a valid value of the data type TD, then let TV be that value. ii) Otherwise, an exception condition is raised: data exception — invalid character value for cast. b) If SD is boolean, then TV is SV ``` In this pull request, we trim all the whitespaces from both ends of the string before converting it to a bool value. This behavior is as same as others, but a bit different from sql standard, which trim only spaces. ### Why are the changes needed? Type cast/coercion consistency ### Does this PR introduce any user-facing change? yes, string with whitespaces in both ends will be trimmed before converted to booleans. e.g. `select cast('\t true' as boolean)` results `true` now, before this pr it's `null` ### How was this patch tested? add unit tests Closes #26776 from yaooqinn/SPARK-30147. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-07 15:03:51 +09:00
Aman Omer	51aa7a920e	[SPARK-30148][SQL] Optimize writing plans if there is an analysis exception ### What changes were proposed in this pull request? Optimized QueryExecution.scala#writePlans(). ### Why are the changes needed? If any query fails in Analysis phase and gets AnalysisException, there is no need to execute further phases since those will return a same result i.e, AnalysisException. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26778 from amanomer/optExplain. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-07 10:58:02 +09:00
Sean Owen	a30ec19a73	[SPARK-30155][SQL] Rename parse() to parseString() to avoid conflict in Scala 2.13 ### What changes were proposed in this pull request? Rename internal method LegacyTypeStringParser.parse() to parseString(). ### Why are the changes needed? In Scala 2.13, the parse() definition clashes with supertype declarations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #26784 from srowen/SPARK-30155. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 16:16:28 -08:00
wuyi	58be82ad4b	[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0. And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`. ### Why are the changes needed? 1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent. 2. User may gets confused in some cases. For example: ``` CREATE TABLE t1 (c1 INT) USING PARQUET; CREATE TABLE t2 (c1 INT); ``` In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case. On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`: ``` CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE; CREATE TABLE t4 AS SELECT 1 AS VALUE; ``` And these two cases together can be really confusing. 3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior. ### Does this PR introduce any user-facing change? Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider. ### How was this patch tested? Added tests in `DDLParserSuite` and `HiveDDlSuite`. Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default. Lead-authored-by: wuyi <yi.wu@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-07 02:15:25 +08:00
Liang-Chi Hsieh	c1a5f94973	[SPARK-30112][SQL] Allow insert overwrite same table if using dynamic partition overwrite ### What changes were proposed in this pull request? This patch proposes to allow insert overwrite same table if using dynamic partition overwrite. ### Why are the changes needed? Currently, Insert overwrite cannot overwrite to same table even it is dynamic partition overwrite. But for dynamic partition overwrite, we do not delete partition directories ahead. We write to staging directories and move data to final partition directories. We should be able to insert overwrite to same table under dynamic partition overwrite. This enables users to read data from a table and insert overwrite to same table by using dynamic partition overwrite. Because this is not allowed for now, users need to write to other temporary location and move it back to the table. ### Does this PR introduce any user-facing change? Yes. Users can insert overwrite same table if using dynamic partition overwrite. ### How was this patch tested? Unit test. Closes #26752 from viirya/dynamic-overwrite-same-table. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 09:22:16 -08:00
gengjiaan	187f3c1773	[SPARK-28083][SQL] Support LIKE ... ESCAPE syntax ## What changes were proposed in this pull request? The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL. For example: ``` select 'abcSpark_13sd' LIKE '%Spark\\_%'; //true select 'abcSpark_13sd' LIKE '%Spark/_%'; //false select 'abcSpark_13sd' LIKE '%Spark"_%'; //false select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; //true select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; //true select 'abcSpark%13sd' LIKE '%Spark\\%%'; //true select 'abcSpark%13sd' LIKE '%Spark/%%'; //false select 'abcSpark%13sd' LIKE '%Spark"%%'; //false select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; //true select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; //true select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; //true select 'abcSpark/13sd' LIKE '%Spark//_%'; //false select 'abcSpark"13sd' LIKE '%Spark""_%'; //false select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; //true select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; //true ``` But Spark SQL only supports 'LIKE predicate'. Note: If the input string or pattern string is null, then the result is null too. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/functions-matching.html Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape MySQL: https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708 https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A ## How was this patch tested? Exists UT and new UT. This PR merged to my production environment and runs above sql: ``` spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%'; true Time taken: 0.119 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%'; false Time taken: 0.103 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%'; false Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; true Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; true Time taken: 0.092 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%'; true Time taken: 0.109 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%'; false Time taken: 0.1 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%'; false Time taken: 0.081 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; true Time taken: 0.095 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; true Time taken: 0.113 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; true Time taken: 0.078 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%'; false Time taken: 0.067 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%'; false Time taken: 0.084 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; true Time taken: 0.091 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; true Time taken: 0.091 seconds, Fetched 1 row(s) ``` I create a table and its schema is: ``` spark-sql> desc formatted gja_test; key string NULL value string NULL other string NULL # Detailed Table Information Database test Table gja_test Owner test Created Time Wed Apr 10 11:06:15 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.1-SNAPSHOT Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1563443838] Statistics 26 bytes Location hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties [field.delim= , serialization.format= ] Partition Provider Catalog Time taken: 0.642 seconds, Fetched 21 row(s) ``` Table `gja_test` exists three rows of data. ``` spark-sql> select * from gja_test; a A ao b B bo "__ """__ " Time taken: 0.665 seconds, Fetched 3 row(s) ``` At finally, I test this function: ``` spark-sql> select * from gja_test where key like value escape '"'; "__ """__ " Time taken: 0.687 seconds, Fetched 1 row(s) ``` Closes #25001 from beliefer/ansi-sql-like. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-06 00:07:38 -08:00
Terry Kim	b86d4bb931	[SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables ### What changes were proposed in this pull request? This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation. ### Why are the changes needed? Currently there are two issues: 1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice). 2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails: ``` scala> sql("create table t using csv as select 1 as i") res2: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show +---+ \| i\| +---+ \| 1\| +---+ scala> sql("select * from spark_catalog.t").show org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14; 'Project [] +- 'UnresolvedRelation [spark_catalog, t] ``` ### Does this PR introduce any user-facing change? Yes. Now the catalog name is resolved correctly: ``` scala> sql("create table t using csv as select 1 as i") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select from t").show +---+ \| i\| +---+ \| 1\| +---+ scala> sql("select * from spark_catalog.t").show +---+ \| i\| +---+ \| 1\| +---+ ``` ### How was this patch tested? Added new tests. Closes #26684 from imback82/resolve_relation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-06 15:45:13 +08:00
madianjun	a5ccbced8c	[SPARK-30067][CORE] Fix fragment offset comparison in getBlockHosts ### What changes were proposed in this pull request? A bug fixed about the code in getBlockHosts() function. In the case "The fragment ends at a position within this block", the end of fragment should be before the end of block，where the "end of block" means `b.getOffset + b.getLength`，not `b.getLength`. ### Why are the changes needed? When comparing the fragment end and the block end，we should use fragment's `offset + length`，and then compare to the block's `b.getOffset + b.getLength`, not the block's length. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? No test. Closes #26650 from mdianjun/fix-getBlockHosts. Authored-by: madianjun <madianjun@jd.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 23:39:49 -08:00
Jungtaek Lim (HeartSaVioR)	25431d79f7	[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink ### What changes were proposed in this pull request? This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup. To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files. ### Why are the changes needed? Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT. Closes #26590 from HeartSaVioR/SPARK-29953. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-12-05 21:46:28 -08:00
Sean Owen	7782b61a31	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748 I told you it'd be iterative! Closes #26765 from srowen/SPARK-29392.3. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 13:48:29 -08:00
Kent Yao	b9cae37750	[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres # What changes were proposed in this pull request? Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy: ```scala /** * For [[Add]]: * 1. if both side are interval, stays the same; * 2. else if one side is interval, turns it to [[TimeAdd]]; * 3. else if one side is date, turns it to [[DateAdd]] ; * 4. else stays the same. * * For [[Subtract]]: * 1. if both side are interval, stays the same; * 2. else if the right side is an interval, turns it to [[TimeSub]]; * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]]; * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]]; * 5. else if the left side is date, turns it to [[DateSub]]; * 6. else turns it to stays the same. * * For [[Multiply]]: * 1. If one side is interval, turns it to [[MultiplyInterval]]; * 2. otherwise, stays the same. * * For [[Divide]]: * 1. If the left side is interval, turns it to [[DivideInterval]]; * 2. otherwise, stays the same. */ ``` Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule. ### Why are the changes needed? Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark. ### Does this PR introduce any user-facing change? 1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results. ### How was this patch tested? add ut Closes #26412 from yaooqinn/SPARK-29774. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 22:03:44 +08:00

1 2 3 4 5 ...

6291 commits