ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Maxim Gekk	4ca31b470f	[SPARK-30606][SQL] Fix the `like` function with 2 parameters ### What changes were proposed in this pull request? In the PR, I propose to add additional constructor in the `Like` expression. The constructor can be used on applying the `like` function with 2 parameters. ### Why are the changes needed? `FunctionRegistry` cannot find a constructor if the `like` function is applied to 2 parameters. ### Does this PR introduce any user-facing change? Yes, before: ```sql spark-sql> SELECT like('Spark', '_park'); Invalid arguments for function like; line 1 pos 7 org.apache.spark.sql.AnalysisException: Invalid arguments for function like; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1412) ``` After: ```sql spark-sql> SELECT like('Spark', '_park'); true ``` ### How was this patch tested? By running `check outputs of expression examples` from `SQLQuerySuite`. Closes #27323 from MaxGekk/fix-like-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-22 15:40:24 -08:00
jiake	6dfaa0783f	[SPARK-30549][SQL] Fix the subquery shown issue in UI When enable AQE ### What changes were proposed in this pull request? After [PR#25316](https://github.com/apache/spark/pull/25316) fixed the dead lock issue in [PR#25308](https://github.com/apache/spark/pull/25308), the subquery metrics can not be shown in UI as following screenshot. ![image](https://user-images.githubusercontent.com/11972570/72891385-160ec980-3d4f-11ea-91fc-ccaad890f7dc.png) This PR fix the subquery UI shown issue by adding `SparkListenerSQLAdaptiveSQLMetricUpdates` event to update the suquery sql metric. After with this PR, the suquery UI can show correctly as following screenshot: ![image](https://user-images.githubusercontent.com/11972570/72893610-66d4f100-3d54-11ea-93c9-f444b2f31952.png) ### Why are the changes needed? Showing the subquery metric in UI when enable AQE ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT Closes #27260 from JkSelf/fixSubqueryUI. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-22 09:02:34 -08:00
Kent Yao	8e280cebf2	[SPARK-30592][SQL] Interval support for csv and json funtions ### What changes were proposed in this pull request? In this PR, I'd propose to fully support interval for the CSV and JSON functions. On one hand, CSV and JSON records consist of string values, in the cast logic, we can cast string from/to interval now, so we can make those functions support intervals easily. Before this change we can only use this as a workaround. ```sql SELECT cast(from_csv('1, 1 day', 'a INT, b string').b as interval) struct<CAST(from_csv(1, 1 day).b AS INTERVAL):interval> 1 days ``` On the other hand, we ban reading or writing intervals from CSV and JSON files. To directly read and write with external json/csv storage, you still need explicit cast, e.g. ```scala spark.read.schema("a string").json("a.json").selectExpr("cast(a as interval)").show +------+ \| a\| +------+ \|1 days\| +------+ ``` ### Why are the changes needed? for interval's future-proofing purpose ### Does this PR introduce any user-facing change? yes, the `to_json`/`from_json` function can deal with intervals now. e.g. for `from_json` there is no such use case because we do not support `a interval` for `to_json`, we can use interval values now #### before ```sql SELECT to_json(map('a', interval 25 month 100 day 130 minute)); Error in query: cannot resolve 'to_json(map('a', INTERVAL '2 years 1 months 100 days 2 hours 10 minutes'))' due to data type mismatch: Unable to convert column a of type interval to JSON.; line 1 pos 7; 'Project [unresolvedalias(to_json(map(a, 2 years 1 months 100 days 2 hours 10 minutes), Some(Asia/Shanghai)), None)] +- OneRowRelation ``` #### after ```sql SELECT to_json(map('a', interval 25 month 100 day 130 minute)) {"a":"2 years 1 months 100 days 2 hours 10 minutes"} ``` ### How was this patch tested? add ut Closes #27317 from yaooqinn/SPARK-30592. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-23 00:41:46 +08:00
Wenchen Fan	b8cb52a8a7	[SPARK-30555][SQL] MERGE INTO insert action should only access columns from source table ### What changes were proposed in this pull request? when resolving the `Assignment` of insert action in MERGE INTO, only resolve with the source table, to avoid ambiguous attribute failure if there is a same-name column in the target table. ### Why are the changes needed? The insert action is used when NOT MATCHED, so it can't access the row from the target table anyway. ### Does this PR introduce any user-facing change? on ### How was this patch tested? new tests Closes #27265 from cloud-fan/merge. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-22 21:31:11 +08:00
Kent Yao	f2d71f5838	[SPARK-30591][SQL] Remove the nonstandard SET OWNER syntax for namespaces ### What changes were proposed in this pull request? This pr removes the nonstandard `SET OWNER` syntax for namespaces and changes the owner reserved properties from `ownerName` and `ownerType` to `owner`. ### Why are the changes needed? the `SET OWNER` syntax for namespaces is hive-specific and non-sql standard, we need a more future-proofing design before we implement user-facing changes for SQL security issues ### Does this PR introduce any user-facing change? no, just revert an unpublic syntax ### How was this patch tested? modified uts Closes #27300 from yaooqinn/SPARK-30591. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-22 16:00:05 +08:00
fuwhu	cfb1706eaa	[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions ### What changes were proposed in this pull request? Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy. ### Why are the changes needed? In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join. Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance. This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Added unit tests. This is based on #25919, credits should go to lianhuiwang and advancedxy. Closes #26805 from fuwhu/SPARK-15616. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 21:26:30 +08:00
yi.wu	ff39c9271c	[SPARK-30252][SQL] Disallow negative scale of Decimal ### What changes were proposed in this pull request? This PR propose to disallow negative `scale` of `Decimal` in Spark. And this PR brings two behavior changes: 1) for literals like `1.23E4BD` or `1.23E4`(with `spark.sql.legacy.exponentLiteralAsDecimal.enabled`=true, see [SPARK-29956](https://issues.apache.org/jira/browse/SPARK-29956)), we set its `(precision, scale)` to (5, 0) rather than (3, -2); 2) add negative `scale` check inside the decimal method if it exposes to set `scale` explicitly. If check fails, `AnalysisException` throws. And user could still use `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to restore the previous behavior. ### Why are the changes needed? According to SQL standard, > 4.4.2 Characteristics of numbers An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer. scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale. Presto: ``` presto:default> create table t (i decimal(2, -1)); Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type> create table t (i decimal(2, -1)) ``` PostgrelSQL: ``` postgres=# create table t(i decimal(2, -1)); ERROR: NUMERIC scale -1 must be between 0 and precision 2 LINE 1: create table t(i decimal(2, -1)); ^ ``` And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL: ``` scala> spark.sql("create table t(i decimal(2, -1))"); org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28) == SQL == create table t(i decimal(2, -1)) ----------------------------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 35 elided ``` However, it is still possible to create such table or `DatFrame` using Spark SQL programming API: ``` scala> val tb = CatalogTable( TableIdentifier("test", None), CatalogTableType.MANAGED, CatalogStorageFormat.empty, StructType(StructField("i", DecimalType(2, -1) ) :: Nil)) ``` ``` scala> spark.sql("SELECT 1.23E4BD") res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)] ``` while, these two different behavior could make user confused. On the other side, even if user creates such table or `DataFrame` with negative scale decimal type, it can't write data out if using format, like `parquet` or `orc`. Because these formats have their own check for negative scale and fail on it. ``` scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet") 19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2 at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309) at org.apache.parquet.schema.Types$Builder.named(Types.java:290) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` So, I think it would be better to disallow negative scale totally and make behaviors above be consistent. ### Does this PR introduce any user-facing change? Yes, if `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false`, user couldn't create Decimal value with negative scale anymore. ### How was this patch tested? Added new tests in `ExpressionParserSuite` and `DecimalSuite`; Updated `SQLQueryTestSuite`. Closes #26881 from Ngone51/nonnegative-scale. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 21:09:48 +08:00
Kent Yao	af705421db	[SPARK-30593][SQL] Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip ### What changes were proposed in this pull request? This revert https://github.com/apache/spark/pull/26418, file a new ticket under https://issues.apache.org/jira/browse/SPARK-30546 for better tracking interval behavior ### Why are the changes needed? Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and there is no round trip ### Does this PR introduce any user-facing change? no, not released yet ### How was this patch tested? existing uts Closes #27304 from yaooqinn/SPARK-30593. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 20:51:10 +08:00
Guy Khazma	2d59ca464e	[SPARK-30475][SQL] File source V2: Push data filters for file listing ### What changes were proposed in this pull request? Follow up on [SPARK-30428](https://github.com/apache/spark/pull/27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](https://github.com/apache/spark/pull/27112#discussion_r364757217)). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes #27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-01-20 20:20:37 -08:00
Maxim Gekk	94284c8ecc	[SPARK-30587][SQL][TESTS] Add test suites for CSV and JSON v1 ### What changes were proposed in this pull request? In the PR, I propose to make `JsonSuite` and `CSVSuite` abstract classes, and add sub-classes that check JSON/CSV datasource v1 and v2. ### Why are the changes needed? To improve test coverage and test JSON/CSV v1 which is still supported, and can be enabled by users. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running new test suites `JsonV1Suite` and `CSVv1Suite`. Closes #27294 from MaxGekk/csv-json-v1-test-suites. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 11:38:05 +08:00
Kent Yao	0388b7a3ec	[SPARK-30568][SQL] Invalidate interval type as a field table schema ### What changes were proposed in this pull request? After this commit `d67b98ea01`, we are able to create table or alter table with interval column types if the external catalog accepts which is varying the interval type's purpose for internal usage. With `d67b98ea01` 's original purpose it should only work from cast logic. Instead of adding type checker for the interval type from commands to commands to work among catalogs, It much simpler to treat interval as an invalid data type but can be identified by cast only. ### Why are the changes needed? enhance interval internal usage purpose. ### Does this PR introduce any user-facing change? NO, Additionally, this PR restores user behavior when using interval type to create/alter table schema, e.g. for hive catalog for 2.4, ```java Caused by: org.apache.spark.sql.catalyst.parser.ParseException: DataType calendarinterval is not supported.(line 1, pos 0) ``` for master after `d67b98ea01` ```java Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval' but 'interval' is found. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:862) ``` now with this pr, we restore the type checker in spark side. ### How was this patch tested? add more ut Closes #27277 from yaooqinn/SPARK-30568. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 11:14:26 +08:00
Kent Yao	24efa43826	[SPARK-30019][SQL] Add the owner property to v2 table ### What changes were proposed in this pull request? Add `owner` property to v2 table, it is reversed by `TableCatalog`, indicates the table's owner. ### Why are the changes needed? enhance ownership management of catalog API ### Does this PR introduce any user-facing change? yes, add 1 reserved property - `owner` , and it is not allowed to use in OPTIONS/TBLPROPERTIES anymore, only if legacy on ### How was this patch tested? add uts Closes #27249 from yaooqinn/SPARK-30019. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 10:37:49 +08:00
Maxim Gekk	fd69533593	[SPARK-30482][CORE][SQL][TESTS][FOLLOW-UP] Output caller info in log appenders while reaching the limit ### What changes were proposed in this pull request? In the PR, I propose to output additional msg from the tests where a log appender is added. The message is printed as a part of `IllegalStateException` in the case of reaching the limit of maximum number of logged events. ### Why are the changes needed? If a log appender is not removed from the log4j appenders list. the caller message could help to investigate the problem and find the test which doesn't remove the log appender. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the modified test suites `AvroSuite`, `CSVSuite`, `ResolveHintsSuite` and etc. Closes #27296 from MaxGekk/assign-name-to-log-appender. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-21 10:19:07 +09:00
yi.wu	f5b345cf3d	[SPARK-30578][SQL][TEST] Explicitly set conf to use DSv2 for orc in OrcFilterSuite ### What changes were proposed in this pull request? Explicitly set conf to let orc use DSv2 in `OrcFilterSuite` in both v1.2 and v2.3. ### Why are the changes needed? Tests should not rely on default conf when they're going to test something intentionally, which can be fail when conf changes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27285 from Ngone51/fix-orcfilter-test. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-20 21:42:33 +08:00
Terry Kim	b5cb9abdd5	[SPARK-30535][SQL] Migrate ALTER TABLE commands to the new framework ### What changes were proposed in this pull request? Use the new framework to resolve the ALTER TABLE commands. This PR also refactors ALTER TABLE logical plans such that they extend a base class `AlterTable`. Each plan now implements `def changes: Seq[TableChange]` for any table change operations. Additionally, `UnresolvedV2Relation` and its usage is completely removed. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated existing tests Closes #27243 from imback82/v2commands_newframework. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-20 21:33:44 +08:00
Josh Rosen	d50f8df929	[SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor, plus related optimization in ParquetMapConverter ### What changes were proposed in this pull request? This PR implements a tiny performance optimization for a `GenericArrayData` constructor, avoiding an unnecessary roundtrip through `WrappedArray` when the provided value is already an array of objects. It also fixes a related performance problem in `ParquetRowConverter`. ### Why are the changes needed? `GenericArrayData` has a `this(seqOrArray: Any)` constructor, which was originally added in #13138 for use in `RowEncoder` (where we may not know concrete types until runtime) but is also called (perhaps unintentionally) in a few other code paths. In this constructor's existing implementation, a call to `new WrappedArray(Array[Object](""))` is dispatched to the `this(seqOrArray: Any)` constructor, where we then call `this(array.toSeq)`: this wraps the provided array into a `WrappedArray`, which is subsequently unwrapped in a `this(seq.toArray)` call. For an interactive example, see https://scastie.scala-lang.org/7jOHydbNTaGSU677FWA8nA This PR changes the `this(seqOrArray: Any)` constructor so that it calls the primary `this(array: Array[Any])` constructor, allowing us to save a `.toSeq.toArray` call; this comes at the cost of one additional `case` in the `match` statement (but I believe this has a negligible performance impact relative to the other savings). As code cleanup, I also reverted the JVM 1.7 workaround from #14271. I also fixed a related performance problem in `ParquetRowConverter`: previously, this code called `ArrayBasedMapData.apply` which, in turn, called the `this(Any)` constructor for `GenericArrayData`: this PR's micro-benchmarks show that this is _significantly_ slower than calling the `this(Array[Any])` constructor (and I also observed time spent here during other Parquet scan benchmarking work). To fix this performance problem, I replaced the call to the `ArrayBasedMapData.apply` method with direct calls to the `ArrayBasedMapData` and `GenericArrayData` constructors. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I tested this by running code in a debugger and by running microbenchmarks (which I've added to a new `GenericArrayDataBenchmark` in this PR): - With JDK8 benchmarks: this PR's changes more than double the performance of calls to the `this(Any)` constructor. Even after improvements, however, calls to the `this(Array[Any])` constructor are still ~60x faster than calls to `this(Any)` when passing a non-primitive array (thereby motivating this patch's other change in `ParquetRowConverter`). - With JDK11 benchmarks: the changes more-or-less completely eliminate the performance penalty associated with the `this(Any)` constructor. Closes #27088 from JoshRosen/joshrosen/GenericArrayData-optimization. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-19 19:12:19 -08:00
Maxim Gekk	d4c6ec6ba7	[SPARK-30530][SQL] Fix filter pushdown for bad CSV records ### What changes were proposed in this pull request? In the PR, I propose to fix the bug reported in SPARK-30530. CSV datasource returns invalid records in the case when `parsedSchema` is shorter than number of tokens returned by UniVocity parser. In the case `UnivocityParser.convert()` always throws `BadRecordException` independently from the result of applying filters. For the described case, I propose to save the exception in `badRecordException` and continue value conversion according to `parsedSchema`. If a bad record doesn't pass filters, `convert()` returns empty Seq otherwise throws `badRecordException`. ### Why are the changes needed? It fixes the bug reported in the JIRA ticket. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new test from the JIRA ticket. Closes #27239 from MaxGekk/spark-30530-csv-filter-is-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-19 17:22:38 +08:00
Kent Yao	17857f9b8b	[SPARK-30551][SQL] Disable comparison for interval type ### What changes were proposed in this pull request? As we are not going to follow ANSI to implement year-month and day-time interval types, it is weird to compare the year-month part to the day-time part for our current implementation of interval type now. Additionally, the current ordering logic comes from PostgreSQL where the implementation of the interval is messy. And we are not aiming PostgreSQL compliance at all. THIS PR will revert https://github.com/apache/spark/pull/26681 and https://github.com/apache/spark/pull/26337 ### Why are the changes needed? make interval type more future-proofing ### Does this PR introduce any user-facing change? there are new in 3.0, so no ### How was this patch tested? existing uts shall work Closes #27262 from yaooqinn/SPARK-30551. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-19 15:27:51 +08:00
jiake	0d99d7e3f2	[SPARK-30524] [SQL] follow up SPARK-30524 to resolve comments ### What changes were proposed in this pull request? Resolve the remaining comments in [PR#27226](https://github.com/apache/spark/pull/27226). ### Why are the changes needed? Resolve the comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27253 from JkSelf/followup-skewjoinoptimization2. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-19 15:10:05 +08:00
HyukjinKwon	a6bdea3ad4	[SPARK-30539][PYTHON][SQL] Add DataFrame.tail in PySpark ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26809 added `Dataset.tail` API. It should be good to have it in PySpark API as well. ### Why are the changes needed? To support consistent APIs. ### Does this PR introduce any user-facing change? No. It adds a new API. ### How was this patch tested? Manually tested and doctest was added. Closes #27251 from HyukjinKwon/SPARK-30539. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-18 00:18:12 -08:00
Gabor Somogyi	abf759a91e	[SPARK-29876][SS] Delete/archive file source completed files in separate thread ### What changes were proposed in this pull request? [SPARK-20568](https://issues.apache.org/jira/browse/SPARK-20568) added the possibility to clean up completed files in streaming query. Deleting/archiving uses the main thread which can slow down processing. In this PR I've created thread pool to handle file delete/archival. The number of threads can be configured with `spark.sql.streaming.fileSource.cleaner.numThreads`. ### Why are the changes needed? Do file delete/archival in separate thread. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26502 from gaborgsomogyi/SPARK-29876. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2020-01-17 10:45:36 -08:00
Luca Canali	fd308ade52	[SPARK-30041][SQL][WEBUI] Add Codegen Stage Id to Stage DAG visualization in Web UI ### What changes were proposed in this pull request? SPARK-29894 provides information on the Codegen Stage Id in WEBUI for SQL Plan graphs. Similarly, this proposes to add Codegen Stage Id in the DAG visualization for Stage execution. DAGs for Stage execution are available in the WEBUI under the Jobs and Stages tabs. ### Why are the changes needed? This is proposed as an aid for drill-down analysis of complex SQL statement execution, as it is not always easy to match parts of the SQL Plan graph with the corresponding Stage DAG execution graph. Adding Codegen Stage Id for WholeStageCodegen operations makes this task easier. ### Does this PR introduce any user-facing change? Stage DAG visualization in the WEBUI will show codegen stage id for WholeStageCodegen operations, as in the example snippet from the WEBUI, Jobs tab (the query used in the example is TPCDS 2.4 q14a): ![](https://issues.apache.org/jira/secure/attachment/12987461/Snippet_StagesDags_with_CodegenId%20_annotated.png) ### How was this patch tested? Manually tested, see also example snippet. Closes #26675 from LucaCanali/addCodegenStageIdtoStageGraph. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-18 01:00:45 +08:00
Terry Kim	64fe192fef	[SPARK-30282][SQL] Migrate SHOW TBLPROPERTIES to new framework ### What changes were proposed in this pull request? Use the new framework to resolve the SHOW TBLPROPERTIES command. This PR along with #27243 should update all the existing V2 commands with `UnresolvedV2Relation`. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-2990](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce any user-facing change? Yes `SHOW TBLPROPERTIES temp_view` now fails with `AnalysisException` will be thrown with a message `temp_view is a temp view not table`. Previously, it was returning empty row. ### How was this patch tested? Existing tests Closes #26921 from imback82/consistnet_v2command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-17 16:51:44 +08:00
Wenchen Fan	0bd7a3dfab	[SPARK-29572][SQL] add v1 read fallback API in DS v2 ### What changes were proposed in this pull request? Add a `V1Scan` interface, so that data source v1 implementations can migrate to DS v2 much easier. ### Why are the changes needed? It's a lot of work to migrate v1 sources to DS v2. The new API added here can allow v1 sources to go through v2 code paths without implementing all the Batch, Stream, PartitionReaderFactory, ... stuff. We already have a v1 write fallback API after https://github.com/apache/spark/pull/25348 ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test suite Closes #26231 from cloud-fan/v1-read-fallback. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-17 12:40:51 +08:00
jiake	6e5b4bf113	[SPARK-30524][SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle ### What changes were proposed in this pull request? `OptimizeSkewedJoin `rule change the `outputPartitioning `after inserting `PartialShuffleReaderExec `or `SkewedPartitionReaderExec`. So it may need to introduce additional to ensure the right result. This PR disable `OptimizeSkewedJoin ` rule when introducing additional shuffle. ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new ut Closes #27226 from JkSelf/followup-skewedoptimization. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-16 22:52:00 +08:00
Kent Yao	82f25f5855	[SPARK-30507][SQL] TableCalalog reserved properties shoudn't be changed via options or tblpropeties ### What changes were proposed in this pull request? TableCatalog reserves some properties, e,g `provider`, `location` for internal usage. Some of them are static once create, some of them need specific syntax to modify. Instead of using `OPTIONS (k='v')` or TBLPROPERTIES (k='v'), if k is a reserved TableCatalog property, we should use its specific syntax to add/modify/delete it. e.g. `provider` is a reserved property, we should use the `USING` clause to specify it, and should not allow `ALTER TABLE ... UNSET TBLPROPERTIES('provider')` to delete it. Also, there are two paths for v1/v2 catalog tables to resolve these properties, e.g. the v1 session catalog tables will only use the `USING` clause to decide `provider` but v2 tables will also lookup OPTION/TBLPROPERTIES(although there is a bug prohibit it). Additionally, 'path' is not reserved but holds special meaning for `LOCATION` and it is used in `CREATE/REPLACE TABLE`'s `OPTIONS` sub-clause. Now for the session catalog tables, the `path` is case-insensitive, but for the non-session catalog tables, it is case-sensitive, we should make it both case insensitive for disambiguation. ### Why are the changes needed? prevent reserved properties from being modified unexpectedly unify the property resolution for v1/v2. fix some bugs. ### Does this PR introduce any user-facing change? yes 1 . `location` and `provider` (case sensitive) cannot be used in `CREATE/REPLACE TABLE ... OPTIONS/TBLPROPETIES` and `ALTER TABLE ... SET TBLPROPERTIES (...)`, if legacy on, they will be ignored to let the command success without having side effects 3. Once `path` in `CREATE/REPLACE TABLE ... OPTIONS` is case insensitive for v1 but sensitive for v2, but now we change it case insensitive for both kinds of tables, then v2 tables will also fail if `LOCATION` and `OPTIONS('PaTh' ='abc')` are both specified or will pick `PaTh`'s value as table location if `LOCATION` is missing. 4. Now we will detect if there are two different case `path` keys or more in `CREATE/REPLACE TABLE ... OPTIONS`, once it is a kind of unexpected last-win policy for v1, and v2 is case sensitive. ### How was this patch tested? add ut Closes #27197 from yaooqinn/SPARK-30507. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-16 21:46:07 +08:00
Maxim Gekk	018bdcc53c	[SPARK-30521][SQL][TESTS] Eliminate deprecation warnings for ExpressionInfo ### What changes were proposed in this pull request? In the PR, I propose to use non-deprecated constructor of `ExpressionInfo` in `SparkSessionExtensionSuite`, and pass valid strings as `examples`, `note`, `since` and `deprecated` parameters. ### Why are the changes needed? Using another constructor allows to eliminate the following deprecation warnings while compiling Spark: ``` Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is deprecated: see corresponding Javadoc for more information. new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended usage"), Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is deprecated: see corresponding Javadoc for more information. new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended usage"), Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is deprecated: see corresponding Javadoc for more information. new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended usage"), ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By compiling and running `SparkSessionExtensionSuite`. Closes #27221 from MaxGekk/eliminate-expr-info-warnings. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-16 13:36:28 +09:00
Maxim Gekk	4e50f0291f	[SPARK-30323][SQL] Support filters pushdown in CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in CSV datasource. The reason of pushing a filter up to `UnivocityParser` is to apply the filter as soon as all its attributes become available i.e. converted from CSV fields to desired values according to the schema. This allows to skip conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of CSV string fields to desired values are comparably expensive ( for example, conversion to `TIMESTAMP` values). Here are details of the implementation: - `UnivocityParser.convert()` converts parsed CSV tokens one-by-one sequentially starting from index 0 up to `parsedSchema.length - 1`. At current index `i`, it applies filters that refer to attributes at row fields indexes `0..i`. If any filter returns `false`, it skips conversions of other input tokens. - Pushed filters are converted to expressions. The expressions are bound to row positions according to `requiredSchema`. The expressions are compiled to predicates via generating Java code. - To be able to apply predicates to partially initialized rows, the predicates are grouped, and combined via the `And` expression. Final predicate at index `N` can refer to row fields at the positions `0..N`, and can be applied to a row even if other fields at the positions `N+1..requiredSchema.lenght-1` are not set. ### Why are the changes needed? The changes improve performance on synthetic benchmarks more than 9 times (on JDK 8 & 11): ``` OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2 Intel(R) Core(TM) i7-4850HQ CPU 2.30GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 11889 11945 52 0.0 118893.1 1.0X pushdown disabled 11790 11860 115 0.0 117902.3 1.0X w/ filters 1240 1278 33 0.1 12400.8 9.6X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suite `CSVFiltersSuite` - Added tests to `CSVSuite` and `UnivocityParserSuite` Closes #26973 from MaxGekk/csv-filters-pushdown. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-16 13:10:08 +09:00
Takeshi Yamamuro	a3a42b30d0	[SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression ### What changes were proposed in this pull request? This pr intends to add filter information in the explain output of an aggregate (This is a follow-up of #26656). Without this pr: ``` scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").explain(true) == Parsed Logical Plan == 'Aggregate ['k], ['k, unresolvedalias('SUM('v, ('v > 3)), None)] +- 'UnresolvedRelation [t] == Analyzed Logical Plan == k: int, sum(v): bigint Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) AS sum(v)#3L] +- SubqueryAlias `default`.`t` +- Relation[k#0,v#1] parquet == Optimized Logical Plan == Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) AS sum(v)#3L] +- Relation[k#0,v#1] parquet == Physical Plan == HashAggregate(keys=[k#0], functions=[sum(cast(v#1 as bigint))], output=[k#0, sum(v)#3L]) +- Exchange hashpartitioning(k#0, 200), true, [id=#20] +- HashAggregate(keys=[k#0], functions=[partial_sum(cast(v#1 as bigint))], output=[k#0, sum#7L]) +- (1) ColumnarToRow +- FileScan parquet default.t[k#0,v#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v:int> scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").show() +---+------+ \| k\|sum(v)\| +---+------+ +---+------+ ``` With this pr: ``` scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").explain(true) == Parsed Logical Plan == 'Aggregate ['k], ['k, unresolvedalias('SUM('v, ('v > 3)), None)] +- 'UnresolvedRelation [t] == Analyzed Logical Plan == k: int, sum(v) FILTER (v > 3): bigint Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) filter (v#1 > 3) AS sum(v) FILTER (v > 3)#5L] +- SubqueryAlias `default`.`t` +- Relation[k#0,v#1] parquet == Optimized Logical Plan == Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) filter (v#1 > 3) AS sum(v) FILTER (v > 3)#5L] +- Relation[k#0,v#1] parquet == Physical Plan == HashAggregate(keys=[k#0], functions=[sum(cast(v#1 as bigint))], output=[k#0, sum(v) FILTER (v > 3)#5L]) +- Exchange hashpartitioning(k#0, 200), true, [id=#20] +- HashAggregate(keys=[k#0], functions=[partial_sum(cast(v#1 as bigint)) filter (v#1 > 3)], output=[k#0, sum#9L]) +- (1) ColumnarToRow +- FileScan parquet default.t[k#0,v#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v:int> scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").show() +---+---------------------+ \| k\|sum(v) FILTER (v > 3)\| +---+---------------------+ +---+---------------------+ ``` ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually. Closes #27198 from maropu/SPARK-27986-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-16 11:11:36 +09:00
Wenchen Fan	883ae331c3	[SPARK-30497][SQL] migrate DESCRIBE TABLE to the new framework ### What changes were proposed in this pull request? Use the new framework to resolve the DESCRIBE TABLE command. The v1 DESCRIBE TABLE command supports both table and view. Checked with Hive and Presto, they don't have DESCRIBE TABLE syntax but only DESCRIBE, which supports both table and view: 1. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DescribeTable/View/MaterializedView/Column 2. https://prestodb.io/docs/current/sql/describe.html We should make it clear that DESCRIBE support both table and view, by renaming the command to `DescribeRelation`. This PR also tunes the framework a little bit to support the case that a command accepts both table and view. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. Note that I make a separate PR here instead of #26921, as I need to update the framework to support a new use case: accept both table and view. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27187 from cloud-fan/describe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-15 17:38:52 -08:00
Jungtaek Lim (HeartSaVioR)	e751bc66a0	[SPARK-30479][SQL] Apply compaction of event log to SQL events ### What changes were proposed in this pull request? This patch addresses adding event filter to handle SQL related events. This patch is next task of SPARK-29779 (#27085), please refer the description of PR #27085 to see overall rationalization of this patch. Below functionalities will be addressed in later parts: * integrate compaction into FsHistoryProvider * documentation about new configuration ### Why are the changes needed? One of major goal of SPARK-28594 is to prevent the event logs to become too huge, and SPARK-29779 achieves the goal. We've got another approach in prior, but the old approach required models in both KVStore and live entities to guarantee compatibility, while they're not designed to do so. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #27164 from HeartSaVioR/SPARK-30479. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2020-01-15 10:47:31 -08:00
Takeshi Yamamuro	5f6cd61913	[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated ### What changes were proposed in this pull request? This pr intends to fix wrong aggregated values in `GROUPING SETS` when there are duplicated grouping sets in a query (e.g., `GROUPING SETS ((k1),(k1))`). For example; ``` scala> spark.table("t").show() +---+---+---+ \| k1\| k2\| v\| +---+---+---+ \| 0\| 0\| 3\| +---+---+---+ scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show() +-------------+---+----+------+ \|grouping_id()\| k1\| k2\|sum(v)\| +-------------+---+----+------+ \| 0\| 0\| 0\| 9\| <---- wrong aggregate value and the correct answer is `3` \| 1\| 0\|null\| 3\| +-------------+---+----+------+ // PostgreSQL case postgres=# select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); k1 \| k2 \| sum ----+------+----- 0 \| 0 \| 3 0 \| 0 \| 3 0 \| 0 \| 3 0 \| NULL \| 3 (4 rows) // Hive case hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); 1 0 NULL 3 0 0 0 3 ``` [MS SQL Server has the same behaviour with PostgreSQL](https://github.com/apache/spark/pull/26961#issuecomment-573638442). This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in `Expand` for avoiding wrongly grouping rows with the same grouping ID. ### Why are the changes needed? To fix bugs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The existing tests. Closes #26961 from maropu/SPARK-29708. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-15 22:02:16 +09:00
Ajith	0c6bd3bd0b	[SPARK-27142][SQL] Provide REST API for SQL information ### What changes were proposed in this pull request? Currently for Monitoring Spark application SQL information is not available from REST but only via UI. REST provides only applications,jobs,stages,environment. This Jira is targeted to provide a REST API so that SQL level information can be found A single SQL query can result into multiple jobs. So for end user who is using STS or spark-sql, the intended highest level of probe is the SQL which he has executed. This information can be seen from SQL tab. Attaching a sample. ![image](https://user-images.githubusercontent.com/22072336/54298729-5524a800-45df-11e9-8e4d-b99a8b882031.png) But same information he cannot access using the REST API exposed by spark and he always have to rely on jobs API which may be difficult. So i intend to expose the information seen in SQL tab in UI via REST API Mainly: Id : Long - execution id of the sql status : String - possible values COMPLETED/RUNNING/FAILED description : String - executed SQL string planDescription : String - Plan representation metrics : Seq[Metrics] - `Metrics` contain `metricName: String, metricValue: String` submissionTime : String - formatted `Date` time of SQL submission duration : Long - total run time in milliseconds runningJobIds : Seq[Int] - sequence of running job ids failedJobIds : Seq[Int] - sequence of failed job ids successJobIds : Seq[Int] - sequence of success job ids * To fetch sql executions: /sql?details=boolean&offset=integer&length=integer * To fetch single execution: /sql/{executionID}?details=boolean \| parameter \| type \| remarks \| \| ------------- \|:-------------:\| -----\| \| details \| boolean \| Optional. Set true to get plan description and metrics information, defaults to false \| \| offset \| integer \| Optional. offset to fetch the executions, defaults to 0 \| \| length \| integer \| Optional. total number of executions to be fetched, defaults to 20 \| ### Why are the changes needed? To support users query SQL information via REST API ### Does this PR introduce any user-facing change? Yes. It provides a new monitoring URL for SQL ### How was this patch tested? Tested manually ![image](https://user-images.githubusercontent.com/22072336/54282168-6d85ca00-45c1-11e9-8935-7586ccf0efff.png) ![image](https://user-images.githubusercontent.com/22072336/54282191-7b3b4f80-45c1-11e9-941c-f0ec37026192.png) Closes #24076 from ajithme/restapi. Lead-authored-by: Ajith <ajith2489@gmail.com> Co-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2020-01-14 10:05:47 -08:00
Erik Erlandson	176b69642e	[SPARK-30423][SQL] Deprecate UserDefinedAggregateFunction ### What changes were proposed in this pull request? * Annotate UserDefinedAggregateFunction as deprecated by SPARK-27296 * Update user doc examples to reflect new ability to register typed Aggregator[IN, BUF, OUT] as an untyped aggregating UDF ### Why are the changes needed? UserDefinedAggregateFunction is being deprecated ### Does this PR introduce any user-facing change? Changes are to user documentation, and deprecation annotations. ### How was this patch tested? Testing was via package build to verify doc generation, deprecation warnings, and successful example compilation. Closes #27193 from erikerlandson/spark-30423. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-14 22:07:13 +08:00
jiake	a2aa966ef6	[SPARK-29544][SQL] optimize skewed partition based on data size ### What changes were proposed in this pull request? Skew Join is common and can severely downgrade performance of queries, especially those with joins. This PR aim to optimization the skew join based on the runtime Map output statistics by adding "OptimizeSkewedPartitions" rule. And The details design doc is [here](https://docs.google.com/document/d/1NkXN-ck8jUOS0COz3f8LUW5xzF8j9HFjoZXWGGX2HAg/edit). Currently we can support "Inner, Cross, LeftSemi, LeftAnti, LeftOuter, RightOuter" join type. ### Why are the changes needed? To optimize the skewed partition in runtime based on AQE ### Does this PR introduce any user-facing change? No ### How was this patch tested? UT Closes #26434 from JkSelf/skewedPartitionBasedSize. Lead-authored-by: jiake <ke.a.jia@intel.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: JiaKe <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-14 20:31:44 +08:00
root1	e0efd213eb	[SPARK-30292][SQL] Throw Exception when invalid string is cast to numeric type in ANSI mode ### What changes were proposed in this pull request? If spark.sql.ansi.enabled is set, throw exception when cast to any numeric type do not follow the ANSI SQL standards. ### Why are the changes needed? ANSI SQL standards do not allow invalid strings to get casted into numeric types and throw exception for that. Currently spark sql gives NULL in such cases. Before: `select cast('str' as decimal) => NULL` After : `select cast('str' as decimal) => invalid input syntax for type numeric: str` These results are after setting `spark.sql.ansi.enabled=true` ### Does this PR introduce any user-facing change? Yes. Now when ansi mode is on users will get arithmetic exception for invalid strings. ### How was this patch tested? Unit Tests Added. Closes #26933 from iRakson/castDecimalANSI. Lead-authored-by: root1 <raksonrakesh@gmail.com> Co-authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-14 17:03:10 +08:00
Maxim Gekk	88fc8dbc09	[SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests ### What changes were proposed in this pull request? In the PR, I propose to define a sub-class of `AppenderSkeleton` in `SparkFunSuite` and reuse it from other tests. The class stores incoming `LoggingEvent` in an array which is available to tests for future analysis of logged events. ### Why are the changes needed? This eliminates code duplication in tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites - `CSVSuite`, `OptimizerLoggingSuite`, `JoinHintSuite`, `CodeGenerationSuite` and `SQLConfSuite`. Closes #27166 from MaxGekk/dedup-appender-skeleton. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-14 16:03:10 +09:00
Maxim Gekk	1846b0261b	[SPARK-30500][SPARK-30501][SQL] Remove SQL configs deprecated in Spark 2.1 and 2.3 ### What changes were proposed in this pull request? In the PR, I propose to remove already deprecated SQL configs: - `spark.sql.variable.substitute.depth` deprecated in Spark 2.1 - `spark.sql.parquet.int64AsTimestampMillis` deprecated in Spark 2.3 Also I moved `removedSQLConfigs` closer to `deprecatedSQLConfigs`. This will allow to have references to other config entries. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? By existing test suites `ParquetQuerySuite` and `SQLConfSuite`. Closes #27169 from MaxGekk/remove-deprecated-conf-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-14 11:06:48 +09:00
HyukjinKwon	6646b3e13e	Revert "[SPARK-28670][SQL] create function should thrown Exception if the resource is not found" This reverts commit `16e5e79877`.	2020-01-14 10:40:35 +09:00
jiake	b389b8c5f0	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE ### What changes were proposed in this pull request? Fix all the failed tests when enable AQE. ### Why are the changes needed? Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests Closes #26813 from JkSelf/enableAQEDefault. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-13 22:55:19 +08:00
Dongjoon Hyun	28fc0437ce	[SPARK-28152][SQL][FOLLOWUP] Add a legacy conf for old MsSqlServerDialect numeric mapping ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/25248 . ### Why are the changes needed? The new behavior cannot access the existing table which is created by old behavior. This PR provides a way to avoid new behavior for the existing users. ### Does this PR introduce any user-facing change? Yes. This will fix the broken behavior on the existing tables. ### How was this patch tested? Pass the Jenkins and manually run JDBC integration test. ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test ``` Closes #27184 from dongjoon-hyun/SPARK-28152-CONF. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-12 23:03:34 -08:00
Dongjoon Hyun	361583d1f5	[SPARK-30409][TEST][FOLLOWUP][HOTFIX] Remove dangling JSONBenchmark-jdk11-results.txt ### What changes were proposed in this pull request? This PR removes a dangling test result, `JSONBenchmark-jdk11-results.txt`. This causes a case-sensitive issue on Mac. ``` $ git clone https://gitbox.apache.org/repos/asf/spark.git spark-gitbox Cloning into 'spark-gitbox'... remote: Counting objects: 671717, done. remote: Compressing objects: 100% (258021/258021), done. remote: Total 671717 (delta 329181), reused 560390 (delta 228097) Receiving objects: 100% (671717/671717), 149.69 MiB \| 950.00 KiB/s, done. Resolving deltas: 100% (329181/329181), done. Updating files: 100% (16090/16090), done. warning: the following paths have collided (e.g. case-sensitive paths on a case-insensitive filesystem) and only one from the same colliding group is in the working tree: 'sql/core/benchmarks/JSONBenchmark-jdk11-results.txt' 'sql/core/benchmarks/JsonBenchmark-jdk11-results.txt' ``` ### Why are the changes needed? Previously, since the file name didn't match with `object JSONBenchmark`, it made a confusion when we ran the benchmark. So, `4e0e4e51c4` renamed `JSONBenchmark` to `JsonBenchmark`. However, at the same time frame, https://github.com/apache/spark/pull/26003 regenerated this file. Recently, https://github.com/apache/spark/pull/27078 regenerates the results with the correct file name, `JsonBenchmark-jdk11-results.txt`. So, we can remove the old one. ### Does this PR introduce any user-facing change? No. This is a test result. ### How was this patch tested? Manually check the following correctly generated files in the master. And, check this PR removes the dangling one. - https://github.com/apache/spark/blob/master/sql/core/benchmarks/JsonBenchmark-results.txt - https://github.com/apache/spark/blob/master/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt Closes #27180 from dongjoon-hyun/SPARK-REMOVE. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-12 23:45:31 +00:00
Maxim Gekk	f5118f81e3	[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/10 \| - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-12 13:18:19 -08:00
Erik Erlandson	1f50a5875b	[SPARK-27296][SQL] Allows Aggregator to be registered as a UDF ## What changes were proposed in this pull request? Defines a new subclass of UDF: `UserDefinedAggregator`. Also allows `Aggregator` to be registered as a udf. Under the hood, the implementation is based on the internal `TypedImperativeAggregate` class that spark's predefined aggregators make use of. The effect is that custom user defined aggregators are now serialized only on partition boundaries instead of being serialized and deserialized at each input row. The two new modes of using `Aggregator` are as follows: ```scala val agg: Aggregator[IN, BUF, OUT] = // typed aggregator val udaf1 = UserDefinedAggregator(agg) val udaf2 = spark.udf.register("agg", agg) ``` ## How was this patch tested? Unit testing has been added that corresponds to the testing suites for `UserDefinedAggregateFunction`. Additionally, unit tests explicitly count the number of aggregator ser/de cycles to ensure that it is governed only by the number of data partitions. To evaluate the performance impact, I did two comparisons. The code and REPL results are recorded on [this gist](https://gist.github.com/erikerlandson/b0e106a4dbaf7f80b4f4f3a21f05f892) To characterize its behavior I benchmarked both a relatively simple aggregator and then an aggregator with a complex structure (a t-digest). ### performance The following compares the new `Aggregator` based aggregation against UDAF. In this scenario, the new aggregation is about 100x faster. The difference in performance impact depends on the complexity of the aggregator. For very simple aggregators (e.g. implementing 'sum', etc), the performance impact is more like 25-30%. ```scala scala> import scala.util.Random._, org.apache.spark.sql.Row, org.apache.spark.tdigest._ import scala.util.Random._ import org.apache.spark.sql.Row import org.apache.spark.tdigest._ scala> val data = sc.parallelize(Vector.fill(50000){(nextInt(2), nextGaussian, nextGaussian.toFloat)}, 5).toDF("cat", "x1", "x2") data: org.apache.spark.sql.DataFrame = [cat: int, x1: double ... 1 more field] scala> val udaf = TDigestUDAF(0.5, 0) udaf: org.apache.spark.tdigest.TDigestUDAF = TDigestUDAF(0.5,0) scala> val bs = Benchmark.sample(10) { data.agg(udaf($"x1"), udaf($"x2")).first } bs: Array[(Double, org.apache.spark.sql.Row)] = Array((16.523,[TDigestSQL(TDigest(0.5,0,130,TDigestMap(-4.9171836327285225 -> (1.0, 1.0), -3.9615949140987685 -> (1.0, 2.0), -3.792874086327091 -> (0.7500781537109753, 2.7500781537109753), -3.720534874164185 -> (1.796754196108008, 4.546832349818983), -3.702105588052377 -> (0.4531676501810167, 5.0), -3.665883591332569 -> (2.3434687534153142, 7.343468753415314), -3.649982231368131 -> (0.6565312465846858, 8.0), -3.5914188829817744 -> (4.0, 12.0), -3.530472305581248 -> (4.0, 16.0), -3.4060489584449467 -> (2.9372251939818383, 18.93722519398184), -3.3000694035428486 -> (8.12412890252889, 27.061354096510726), -3.2250016655261877 -> (8.30564453211017, 35.3669986286209), -3.180537395623448 -> (6.001782561137285, 41.3687811... scala> bs.map(_._1) res0: Array[Double] = Array(16.523, 17.138, 17.863, 17.801, 17.769, 17.786, 17.744, 17.8, 17.939, 17.854) scala> val agg = TDigestAggregator(0.5, 0) agg: org.apache.spark.tdigest.TDigestAggregator = TDigestAggregator(0.5,0) scala> val udaa = spark.udf.register("tdigest", agg) udaa: org.apache.spark.sql.expressions.UserDefinedAggregator[Double,org.apache.spark.tdigest.TDigestSQL,org.apache.spark.tdigest.TDigestSQL] = UserDefinedAggregator(TDigestAggregator(0.5,0),None,true,true) scala> val bs = Benchmark.sample(10) { data.agg(udaa($"x1"), udaa($"x2")).first } bs: Array[(Double, org.apache.spark.sql.Row)] = Array((0.313,[TDigestSQL(TDigest(0.5,0,130,TDigestMap(-4.9171836327285225 -> (1.0, 1.0), -3.9615949140987685 -> (1.0, 2.0), -3.792874086327091 -> (0.7500781537109753, 2.7500781537109753), -3.720534874164185 -> (1.796754196108008, 4.546832349818983), -3.702105588052377 -> (0.4531676501810167, 5.0), -3.665883591332569 -> (2.3434687534153142, 7.343468753415314), -3.649982231368131 -> (0.6565312465846858, 8.0), -3.5914188829817744 -> (4.0, 12.0), -3.530472305581248 -> (4.0, 16.0), -3.4060489584449467 -> (2.9372251939818383, 18.93722519398184), -3.3000694035428486 -> (8.12412890252889, 27.061354096510726), -3.2250016655261877 -> (8.30564453211017, 35.3669986286209), -3.180537395623448 -> (6.001782561137285, 41.36878118... scala> bs.map(_._1) res1: Array[Double] = Array(0.313, 0.193, 0.175, 0.185, 0.174, 0.176, 0.16, 0.186, 0.171, 0.179) scala> ``` Closes #25024 from erikerlandson/spark-27296. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-12 15:18:30 +08:00
Liang-Chi Hsieh	b04407169b	[SPARK-30312][SQL][FOLLOWUP] Use inequality check instead to be robust ### What changes were proposed in this pull request? This is a followup to fix a brittle assert in a test case. ### Why are the changes needed? Original assert assumes that default permission is `rwxr-xr-x`, but in jenkins [env](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/6/testReport/junit/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/SPARK_30312__truncate_table___keep_acl_permission/) it could be `rwxrwxr-x`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27175 from viirya/hot-fix. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-11 13:19:04 -08:00
Liang-Chi Hsieh	b5bc3e12a6	[SPARK-30312][SQL] Preserve path permission and acl when truncate table ### What changes were proposed in this pull request? This patch proposes to preserve existing permission/acls of paths when truncate table/partition. ### Why are the changes needed? When Spark SQL truncates table, it deletes the paths of table/partitions, then re-create new ones. If permission/acls were set on the paths, the existing permission/acls will be deleted. We should preserve the permission/acls if possible. ### Does this PR introduce any user-facing change? Yes. When truncate table/partition, Spark will keep permission/acls of paths. ### How was this patch tested? Unit test. Manual test: 1. Create a table. 2. Manually change it permission/acl 3. Truncate table 4. Check permission/acl ```scala val df = Seq(1, 2, 3).toDF df.write.mode("overwrite").saveAsTable("test.test_truncate_table") val testTable = spark.table("test.test_truncate_table") testTable.show() +-----+ \|value\| +-----+ \| 1\| \| 2\| \| 3\| +-----+ // hdfs dfs -setfacl ... // hdfs dfs -getfacl ... sql("truncate table test.test_truncate_table") // hdfs dfs -getfacl ... val testTable2 = spark.table("test.test_truncate_table") testTable2.show() +-----+ \|value\| +-----+ +-----+ ``` ![Screen Shot 2019-12-30 at 3 12 15 PM](https://user-images.githubusercontent.com/68855/71604577-c7875a00-2b17-11ea-913a-ba88096d20ab.jpg) Closes #26956 from viirya/truncate-table-permission. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-10 11:46:28 -08:00
Zhenhua Wang	2bd8731813	[SPARK-30468][SQL] Use multiple lines to display data columns for show create table command ### What changes were proposed in this pull request? Currently data columns are displayed in one line for show create table command, when the table has many columns (to make things even worse, columns may have long names or comments), the displayed result is really hard to read. To improve readability, we print each column in a separate line. Note that other systems like Hive/MySQL also display in this way. Also, for data columns, table properties and options, we put the right parenthesis to the end of the last column/property/option, instead of occupying a separate line. ### Why are the changes needed? for better readability ### Does this PR introduce any user-facing change? before the change: ``` spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet OPTIONS ( `bar` '2', `foo` '1' ) TBLPROPERTIES ( 'a' = 'x', 'b' = 'y' ) ``` after the change: ``` spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet OPTIONS ( `bar` '2', `foo` '1') TBLPROPERTIES ( 'a' = 'x', 'b' = 'y') ``` ### How was this patch tested? modified existing tests Closes #27147 from wzhfy/multi_line_columns. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-10 10:55:53 -06:00
root1	2a629e5d10	[SPARK-30234][SQL] ADD FILE cannot add directories from sql CLI ### What changes were proposed in this pull request? Now users can add directories from sql CLI as well using ADD FILE command and setting spark.sql.addDirectory.recursive to true. ### Why are the changes needed? In SPARK-4687, support was added for adding directories as resources. But sql users cannot use that feature from CLI. `ADD FILE /path/to/folder` gives the following error: `org.apache.spark.SparkException: Added file /path/to/folder is a directory and recursive is not turned on.` Users need to turn on `recursive` for adding directories. Thus a configuration was required which will allow users to turn on `recursive`. Also Hive allow users to add directories from their shell. ### Does this PR introduce any user-facing change? Yes. Users can set recursive using `spark.sql.addDirectory.recursive`. ### How was this patch tested? Manually. Will add test cases soon. SPARK SCREENSHOTS When `spark.sql.addDirectory.recursive` is not turned on. ![Screenshot from 2019-12-13 08-02-13](https://user-images.githubusercontent.com/15366835/70765124-c6b4a100-1d7f-11ea-9352-9c010af5b38b.png) After setting `spark.sql.addDirectory.recursive` to true. ![Screenshot from 2019-12-13 08-02-59](https://user-images.githubusercontent.com/15366835/70765118-be5c6600-1d7f-11ea-9faf-0b1c46ee299b.png) HIVE SCREENSHOT ![Screenshot from 2019-12-13 14-44-41](https://user-images.githubusercontent.com/15366835/70788979-17e08700-1db8-11ea-9c0c-b6d6f6e80a35.png) `RELEASE_NOTES.txt` is text file while `dummy` is a directory. Closes #26863 from iRakson/SPARK-30234. Lead-authored-by: root1 <raksonrakesh@gmail.com> Co-authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-10 22:36:45 +09:00
Peter Toth	418f7dc973	[SPARK-30447][SQL] Constant propagation nullability issue ## What changes were proposed in this pull request? This PR fixes `ConstantPropagation` rule as the current implementation produce incorrect results in some cases. E.g. ``` SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) ``` returns those rows where `c` is null due to `1 + 1 = 1` propagation but it shouldn't. ## Why are the changes needed? To fix a bug. ## Does this PR introduce any user-facing change? Yes, fixes a bug. ## How was this patch tested? New UTs. Closes #27119 from peter-toth/SPARK-30447. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-10 21:42:10 +09:00
Kent Yao	bcf07cbf5f	[SPARK-30018][SQL] Support ALTER DATABASE SET OWNER syntax ### What changes were proposed in this pull request? In this pull request, we are going to support `SET OWNER` syntax for databases and namespaces, ```sql ALTER (DATABASE\|SCHEME\|NAMESPACE) database_name SET OWNER [USER\|ROLE\|GROUP] user_or_role_group; ``` Before this commit `332e252a14`, we didn't care much about ownerships for the catalog objects. In `332e252a14`, we determined to use properties to store ownership staff, and temporarily used `alter database ... set dbproperties ...` to support switch ownership of a database. This PR aims to use the formal syntax to replace it. In hive, `ownerName/Type` are fields of the database objects, also they can be normal properties. ``` create schema test1 with dbproperties('ownerName'='yaooqinn') ``` The create/alter database syntax will not change the owner to `yaooqinn` but store it in parameters. e.g. ``` +----------+----------+---------------------------------------------------------------+-------------+-------------+-----------------------+--+ \| db_name \| comment \| location \| owner_name \| owner_type \| parameters \| +----------+----------+---------------------------------------------------------------+-------------+-------------+-----------------------+--+ \| test1 \| \| hdfs://quickstart.cloudera:8020/user/hive/warehouse/test1.db \| anonymous \| USER \| {ownerName=yaooqinn} \| +----------+----------+---------------------------------------------------------------+-------------+-------------+-----------------------+--+ ``` In this pull request, because we let the `ownerName` become reversed, so it will neither change the owner nor store in dbproperties, just be omitted silently. ## Why are the changes needed? Formal syntax support for changing database ownership ### Does this PR introduce any user-facing change? yes, add a new syntax ### How was this patch tested? add unit tests Closes #26775 from yaooqinn/SPARK-30018. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-10 16:47:08 +08:00
Wenchen Fan	0ec0355611	[SPARK-30439][SQL] Support non-nullable column in CREATE TABLE, ADD COLUMN and ALTER TABLE ### What changes were proposed in this pull request? Allow users to specify NOT NULL in CREATE TABLE and ADD COLUMN column definition, and add a new SQL syntax to alter column nullability: ALTER TABLE ... ALTER COLUMN SET/DROP NOT NULL. This is a SQL standard syntax: ``` <alter column definition> ::= ALTER [ COLUMN ] <column name> <alter column action> <alter column action> ::= <set column default clause> \| <drop column default clause> \| <set column not null clause> \| <drop column not null clause> \| ... <set column not null clause> ::= SET NOT NULL <drop column not null clause> ::= DROP NOT NULL ``` ### Why are the changes needed? Previously we don't support it because the table schema in hive catalog are always nullable. Since we have catalog plugin now, it makes more sense to support NOT NULL at spark side, and let catalog implementations to decide if they support it or not. ### Does this PR introduce any user-facing change? Yes, this is a new feature ### How was this patch tested? new tests Closes #27110 from cloud-fan/nullable. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-10 10:34:46 +09:00
Maxim Gekk	1ffa627ffb	[SPARK-30416][SQL] Log a warning for deprecated SQL config in `set()` and `unset()` ### What changes were proposed in this pull request? 1. Put all deprecated SQL configs the map `SQLConf.deprecatedSQLConfigs` with extra info about when configs were deprecated and additional comments that explain why a config was deprecated, what an user can use instead of it. Here is the list of already deprecated configs: - spark.sql.hive.verifyPartitionPath - spark.sql.execution.pandas.respectSessionTimeZone - spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName - spark.sql.parquet.int64AsTimestampMillis - spark.sql.variable.substitute.depth - spark.sql.execution.arrow.enabled - spark.sql.execution.arrow.fallback.enabled 2. Output warning in `set()` and `unset()` about deprecated SQL configs ### Why are the changes needed? This should improve UX with Spark SQL and notify users about already deprecated SQL configs. ### Does this PR introduce any user-facing change? Yes, before: ``` spark-sql> set spark.sql.hive.verifyPartitionPath=true; spark.sql.hive.verifyPartitionPath true ``` After: ``` spark-sql> set spark.sql.hive.verifyPartitionPath=true; 20/01/03 21:28:17 WARN RuntimeConfig: The SQL config 'spark.sql.hive.verifyPartitionPath' has been deprecated in Spark v3.0.0 and may be removed in the future. This config is replaced by spark.files.ignoreMissingFiles. spark.sql.hive.verifyPartitionPath true ``` ### How was this patch tested? Add new test which registers new log appender and catches all logging to check that `set()` and `unset()` log any warning. Closes #27092 from MaxGekk/group-deprecated-sql-configs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-10 10:32:36 +09:00
shane knapp	4d23938893	[MINOR][SQL][TEST-HIVE1.2] Fix scalastyle error due to length line in hive-1.2 profile ### What changes were proposed in this pull request? fixing a broken build: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/3/console ### Why are the changes needed? the build is teh borked! ### Does this PR introduce any user-facing change? newp ### How was this patch tested? by the build system Closes #27156 from shaneknapp/fix-scala-style. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2020-01-09 15:28:45 -08:00
yi.wu	c0e9f9ffb1	[SPARK-30459][SQL] Fix ignoreMissingFiles/ignoreCorruptFiles in data source v2 ### What changes were proposed in this pull request? Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2: When `FilePartitionReader` finds a missing or corrupt file, it should just skip and continue to read next file rather than stop with current behavior. ### Why are the changes needed? ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated existed test for `ignoreMissingFiles`. Note I didn't update tests for `ignoreCorruptFiles`, because there're various datasources has tests for `ignoreCorruptFiles`. So I'm not sure if it's worth to touch all those tests since the basic logic of `ignoreCorruptFiles` should be same with `ignoreMissingFiles`. Closes #27136 from Ngone51/improve-missing-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-01-09 11:35:29 -08:00
Burak Yavuz	f8d59572b0	[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider ### What changes were proposed in this pull request? This PR introduces `SupportsCatalogOptions` as an interface for `TableProvider`. Through `SupportsCatalogOptions`, V2 DataSources can implement the two methods `extractIdentifier` and `extractCatalog` to support the creation, and existence check of tables without requiring a formal TableCatalog implementation. We currently don't support all SaveModes for DataSourceV2 in DataFrameWriter.save. The idea here is that eventually File based tables can be written with `DataFrameWriter.save(path)` will create a PathIdentifier where the name is `path`, and the V2SessionCatalog will be able to perform FileSystem checks at `path` to support ErrorIfExists and Ignore SaveModes. ### Why are the changes needed? To support all Save modes for V2 data sources with DataFrameWriter. Since we can now support table creation, we will be able to provide partitioning information when first creating the table as well. ### Does this PR introduce any user-facing change? Introduces a new interface ### How was this patch tested? Will add tests once interface is vetted. Closes #26913 from brkyvz/catalogOptions. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-09 11:18:16 -08:00
Gengliang Wang	94fc0e3235	[SPARK-30428][SQL] File source V2: support partition pruning ### What changes were proposed in this pull request? File source V2: support partition pruning. Note: subquery predicates are not pushed down for partition pruning even after this PR, due to the limitation for the current data source V2 API and framework. The rule `PlanSubqueries` requires the subquery expression to be in the children or class parameters in `SparkPlan`, while the condition is not satisfied for `BatchScanExec`. ### Why are the changes needed? It's important for reading performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New unit tests for all the V2 file sources Closes #27112 from gengliangwang/PartitionPruningInFileScan. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-09 21:53:37 +08:00
Kent Yao	c37312342e	[SPARK-30183][SQL] Disallow to specify reserved properties in CREATE/ALTER NAMESPACE syntax ### What changes were proposed in this pull request? Currently, COMMENT and LOCATION are reserved properties for Datasource v2 namespaces. They can be set via specific clauses and via properties. And the ones specified in clauses take precede of properties. Since they are reserved, which means they are not able to visit directly. They should be used in COMMENT/LOCATION clauses ONLY. ### Why are the changes needed? make reserved properties be reserved. ### Does this PR introduce any user-facing change? yes, 'location', 'comment' are not allowed use in db properties ### How was this patch tested? UNIT tests. Closes #26806 from yaooqinn/SPARK-30183. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-09 10:52:36 +08:00
HyukjinKwon	ee8d661058	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package ### What changes were proposed in this pull request? This PR proposes to move pandas related functionalities into pandas package. Namely: ```bash pyspark/sql/pandas ├── __init__.py ├── conversion.py # Conversion between pandas <> PySpark DataFrames ├── functions.py # pandas_udf ├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply ├── map_ops.py # Map Iter UDF + mapInPandas ├── serializers.py # pandas <> PyArrow serializers ├── types.py # Type utils between pandas <> PyArrow └── utils.py # Version requirement checks ``` In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below: ```python class PandasMapOpsMixin(object): def mapInPandas(self, ...): ... return ... # other Pandas <> PySpark APIs ``` ```python class DataFrame(PandasMapOpsMixin): # other DataFrame APIs equivalent to Scala side. ``` Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods. ### Why are the changes needed? There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now. Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Also, I manually built the PySpark API documentation and checked. Closes #27109 from HyukjinKwon/pandas-refactoring. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-09 10:22:50 +09:00
maryannxue	af2d3d0179	[SPARK-30315][SQL] Add adaptive execution context ### What changes were proposed in this pull request? This is a minor code refactoring PR. It creates an adaptive execution context class to wrap objects shared across main query and sub-queries. ### Why are the changes needed? This refactoring will improve code readability and reduce the number of parameters used to initialize `AdaptiveSparkPlanExec`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing UTs. Closes #26959 from maryannxue/aqe-context. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-08 16:11:46 -08:00
Jungtaek Lim (HeartSaVioR)	bd7510bcb7	[SPARK-30281][SS] Consider partitioned/recursive option while verifying archive path on FileStreamSource ### What changes were proposed in this pull request? This patch renews the verification logic of archive path for FileStreamSource, as we found the logic doesn't take partitioned/recursive options into account. Before the patch, it only requires the archive path to have depth more than 2 (two subdirectories from root), leveraging the fact FileStreamSource normally reads the files where the parent directory matches the pattern or the file itself matches the pattern. Given 'archive' operation moves the files to the base archive path with retaining the full path, archive path is tend to be safe if the depth is more than 2, meaning FileStreamSource doesn't re-read archived files as new source files. WIth partitioned/recursive options, the fact is invalid, as FileStreamSource can read any files in any depth of subdirectories for source pattern. To deal with this correctly, we have to renew the verification logic, which may not intuitive and simple but works for all cases. The new verification logic prevents both cases: 1) archive path matches with source pattern as "prefix" (the depth of archive path > the depth of source pattern) e.g. * source pattern: `/hello/spar?` archive path: `/hello/spark/structured/streaming` Any files in archive path will match with source pattern when recursive option is enabled. 2) source pattern matches with archive path as "prefix" (the depth of source pattern > the depth of archive path) e.g. * source pattern: `/hello/spar?/structured/hello2` * archive path: `/hello/spark/structured` Some archive files will not match with source pattern, e.g. file path: `/hello/spark/structured/hello2`, then final archived path: `/hello/spark/structured/hello/spark/structured/hello2`. But some other archive files will still match with source pattern, e.g. file path: `/hello2/spark/structured/hello2`, then final archived path: `/hello/spark/structured/hello2/spark/structured/hello2` which matches with source pattern when recursive is enabled. Implicitly it also prevents archive path matches with source pattern as full match (same depth). We would want to prevent any source files to be archived and added to new source files again, so the patch takes most restrictive approach to prevent the possible cases. ### Why are the changes needed? Without this patch, there's a chance archived files are included as new source files when partitioned/recursive option is enabled, as current condition doesn't take these options into account. ### Does this PR introduce any user-facing change? Only for Spark 3.0.0-preview (only preview 1 for now, but possibly preview 2 as well) - end users are required to provide archive path with ensuring a bit complicated conditions, instead of simply higher than 2 depths. ### How was this patch tested? New UT. Closes #26920 from HeartSaVioR/SPARK-30281. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2020-01-08 09:15:41 -08:00
zhengruifeng	a93b996635	[MINOR][ML][INT] Array.fill(0) -> Array.ofDim; Array.empty -> Array.emptyIntArray ### What changes were proposed in this pull request? 1, for primitive types `Array.fill(n)(0)` -> `Array.ofDim(n)`; 2, for `AnyRef` types `Array.fill(n)(null)` -> `Array.ofDim(n)`; 3, for primitive types `Array.empty[XXX]` -> `Array.emptyXXXArray` ### Why are the changes needed? `Array.ofDim` avoid assignments; `Array.emptyXXXArray` avoid create new object; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27133 from zhengruifeng/minor_fill_ofDim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-09 00:07:42 +09:00
Zhenhua Wang	fa36966b1e	[SPARK-30410][SQL] Calculating size of table with large number of partitions causes flooding logs ### What changes were proposed in this pull request? For a partitioned table, if the number of partitions are very large, e.g. tens of thousands or even larger, calculating its total size causes flooding logs. The flooding happens in: 1. `calculateLocationSize` prints the starting and ending for calculating the location size, and it is called per partition; 2. `bulkListLeafFiles` prints all partition paths. This pr is to simplify the logging when calculating the size of a partitioned table. ### How was this patch tested? not related Closes #27079 from wzhfy/improve_log. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-08 08:22:38 -06:00
fuwhu	047bff06c3	[SPARK-30215][SQL] Remove PrunedInMemoryFileIndex and merge its functionality into InMemoryFileIndex ### What changes were proposed in this pull request? Remove PrunedInMemoryFileIndex and merge its functionality into InMemoryFileIndex. ### Why are the changes needed? PrunedInMemoryFileIndex is only used in CatalogFileIndex.filterPartitions, and its name is kind of confusing, we can completely merge its functionality into InMemoryFileIndex and remove the class. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #26850 from fuwhu/SPARK-30215. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-08 20:28:15 +08:00
Terry Kim	b2ed6d0b88	[SPARK-30214][SQL][FOLLOWUP] Remove statement logical plans for namespace commands ### What changes were proposed in this pull request? This is a follow-up to address the following comment: https://github.com/apache/spark/pull/27095#discussion_r363152180 Currently, a SQL command string is parsed to a "statement" logical plan, converted to a logical plan with catalog/namespace, then finally converted to a physical plan. With the new resolution framework, there is no need to create a "statement" logical plan; a logical plan can contain `UnresolvedNamespace` which will be resolved to a `ResolvedNamespace`. This should simply the code base and make it a bit easier to add a new command. ### Why are the changes needed? Clean up codebase. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover the changes. Closes #27125 from imback82/SPARK-30214-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-08 19:33:19 +08:00
Zhenhua Wang	9535776e28	[SPARK-30302][SQL] Complete info for show create table for views ### What changes were proposed in this pull request? Add table/column comments and table properties to the result of show create table of views. ### Does this PR introduce any user-facing change? When show create table for views, after this patch, the result can contain table/column comments and table properties if they exist. ### How was this patch tested? add new tests Closes #26944 from wzhfy/complete_show_create_view. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-08 11:28:37 +09:00
Pablo Langa	9479887ba1	[SPARK-30039][SQL] CREATE FUNCTION should do multi-catalog resolution ### What changes were proposed in this pull request? Add CreateFunctionStatement and make CREATE FUNCTION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing CREATE FUNCTION namespace.function ### Does this PR introduce any user-facing change? Yes. When running CREATE FUNCTION namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26890 from planga82/feature/SPARK-30039_CreateFunctionV2Command. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-08 00:38:15 +08:00
Kent Yao	8c121b0827	[SPARK-30431][SQL] Update SqlBase.g4 to create commentSpec pattern like locationSpec ### What changes were proposed in this pull request? In `SqlBase.g4`, the `comment` clause is used as `COMMENT comment=STRING` and `COMMENT STRING` in many places. While the `location` clause often appears along with the `comment` clause with a pattern defined as ```sql locationSpec : LOCATION STRING ; ``` Then, we have to visit `locationSpec` as a `List` but comment as a single token. We defined `commentSpec` for the comment clause to simplify and unify the grammar and the invocations. ### Why are the changes needed? To simplify the grammar. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27102 from yaooqinn/SPARK-30431. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 22:12:09 +08:00
Terry Kim	314e70fe23	[SPARK-30214][SQL] V2 commands resolves namespaces with new resolution framework ### What changes were proposed in this pull request? #26847 introduced new framework for resolving catalog/namespaces. This PR proposes to integrate commands that need to resolve namespaces into the new framework. ### Why are the changes needed? This is one of the work items for moving into the new resolution framework. Resolving v1/v2 tables with the new framework will be followed up in different PRs. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover the changes. Closes #27095 from imback82/unresolved_ns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 21:32:08 +08:00
HyukjinKwon	866b7df348	[SPARK-30335][SQL][DOCS] Add a note first, last, collect_list and collect_set can be non-deterministic in SQL function docs as well ### What changes were proposed in this pull request? This PR adds a note first and last can be non-deterministic in SQL function docs as well. This is already documented in `functions.scala`. ### Why are the changes needed? Some people look reading SQL docs only. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Jenkins will test. Closes #27099 from HyukjinKwon/SPARK-30335. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-07 14:31:59 +09:00
Josh Rosen	7a1a5db35f	[SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors ### What changes were proposed in this pull request? This PR implements multiple performance optimizations for `ParquetRowConverter`, achieving some modest constant-factor wins for all fields and larger wins for map and array fields: - Add `private[this]` to several `val`s (90cebf080a5d3857ea8cf2a89e8e060b8b5a2fbf) - Keep a `fieldUpdaters` array, saving two`.updater()` calls per field (7318785d350cc924198d7514e40973fd76d54ad5): I suspect that these are often megamorphic calls, so cutting these out seems like it could be a relatively large performance win. - Only call `currentRow.numFields` once per `start()` call (e05de150813b639929c18af1df09ec718d2d16fc): previously we'd call it once per field and this had a significant enough cost that it was visible during profiling. - Reuse buffers in array and map converters (c7d1534685fbad5d2280b082f37bed6d75848e76, 6d16f596ef6af9fd8946a062f79d0eeace9e1959): previously we would create a brand-new Scala `ArrayBuffer` for each field read, but this isn't actually necessary because the data is already copied into a fresh array when `end()` constructs a `GenericArrayData`. ### Why are the changes needed? To improve Parquet read performance; this is complementary to #26993's (orthogonal) improvements for nested struct read performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests, plus manual benchmarking with both synthetic and realistic schemas (similar to the ones in #26993). I've seen ~10%+ improvements in scan performance on certain real-world datasets. Closes #27089 from JoshRosen/joshrosen/more-ParquetRowConverter-optimizations. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@stripe.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-07 14:30:10 +09:00
Josh Rosen	93d3ab88cd	[SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter ### What changes were proposed in this pull request? This PR modifies `ParquetRowConverter` to remove unnecessary `InternalRow.copy()` calls for structs that are directly nested in other structs. ### Why are the changes needed? These changes can significantly improve performance when reading Parquet files that contain deeply-nested structs with many fields. The `ParquetRowConverter` uses per-field `Converter`s for handling individual fields. Internally, these converters may have mutable state and may return mutable objects. In most cases, each `converter` is only invoked once per Parquet record (this is true for top-level fields, for example). However, arrays and maps may call their child element converters multiple times per Parquet record: in these cases we must be careful to copy any mutable outputs returned by child converters. In the existing code, `InternalRow`s are copied whenever they are stored into _any_ parent container (not just maps and arrays). This copying can be especially expensive for deeply-nested fields, since a deep copy is performed at every level of nesting. This PR modifies the code to avoid copies for structs that are directly nested in structs; see inline code comments for an argument for why this is safe. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Correctness: I added new test cases to `ParquetIOSuite` to increase coverage of nested structs, including structs nested in arrays: previously this suite didn't test that case, so we used to lack mutation coverage of this `copy()` code (the suite's tests still passed if I incorrectly removed the `.copy()` in all cases). I also added a test for maps with struct keys and modified the existing "map with struct values" test case include maps with two elements (since the incorrect omission of a `copy()` can only be detected if the map has multiple elements). Performance: I put together a simple local benchmark demonstrating the performance problems: First, construct a nested schema: ```scala case class Inner( f1: Int, f2: Long, f3: String, f4: Int, f5: Long, f6: String, f7: Int, f8: Long, f9: String, f10: Int ) case class Wrapper1(inner: Inner) case class Wrapper2(wrapper1: Wrapper1) case class Wrapper3(wrapper2: Wrapper2) ``` `Wrapper3`'s schema looks like: ``` root \|-- wrapper2: struct (nullable = true) \| \|-- wrapper1: struct (nullable = true) \| \| \|-- inner: struct (nullable = true) \| \| \| \|-- f1: integer (nullable = true) \| \| \| \|-- f2: long (nullable = true) \| \| \| \|-- f3: string (nullable = true) \| \| \| \|-- f4: integer (nullable = true) \| \| \| \|-- f5: long (nullable = true) \| \| \| \|-- f6: string (nullable = true) \| \| \| \|-- f7: integer (nullable = true) \| \| \| \|-- f8: long (nullable = true) \| \| \| \|-- f9: string (nullable = true) \| \| \| \|-- f10: integer (nullable = true) ``` Next, generate some fake data: ```scala val data = spark.range(1, 1000 * 1000 * 25, 1, 1).map { i => Wrapper3(Wrapper2(Wrapper1(Inner( i.toInt, i * 2, (i * 3).toString, (i * 4).toInt, i * 5, (i * 6).toString, (i * 7).toInt, i * 8, (i * 9).toString, (i * 10).toInt )))) } data.write.mode("overwrite").parquet("/tmp/parquet-test") ``` I then ran a simple benchmark consisting of ``` spark.read.parquet("/tmp/parquet-test").selectExpr("hash()").rdd.count() ``` where the `hash()` is designed to force decoding of all Parquet fields but avoids `RowEncoder` costs in the `.rdd.count()` stage. In the old code, expensive copying takes place at every level of nesting; this is apparent in the following flame graph: ![image](https://user-images.githubusercontent.com/50748/71389014-88a15380-25af-11ea-9537-3e87a2aef179.png) After this PR's changes, the above toy benchmark runs ~30% faster. Closes #26993 from JoshRosen/joshrosen/faster-parquet-nested-scan-by-avoiding-copies. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@stripe.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 13:01:37 +08:00
Yuming Wang	17881a467a	[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty ## What changes were proposed in this pull request? We invalidate table relation once table data is changed by [SPARK-21237](https://issues.apache.org/jira/browse/SPARK-21237). But there is a situation we have not invalidated(`spark.sql.statistics.size.autoUpdate.enabled=false` and `table.stats.isEmpty`): `07c4b9bd1f/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (L44-L54)` This will introduce some issues, e.g. [SPARK-19784](https://issues.apache.org/jira/browse/SPARK-19784), [SPARK-19845](https://issues.apache.org/jira/browse/SPARK-19845), [SPARK-25403](https://issues.apache.org/jira/browse/SPARK-25403), [SPARK-25332](https://issues.apache.org/jira/browse/SPARK-25332) and [SPARK-28413](https://issues.apache.org/jira/browse/SPARK-28413). This is a example to reproduce [SPARK-19784](https://issues.apache.org/jira/browse/SPARK-19784): ```scala val path = "/tmp/spark/parquet" spark.sql("CREATE TABLE t (a INT) USING parquet") spark.sql("INSERT INTO TABLE t VALUES (1)") spark.range(5).toDF("a").write.parquet(path) spark.sql(s"ALTER TABLE t SET LOCATION '${path}'") spark.table("t").count() // return 1 spark.sql("refresh table t") spark.table("t").count() // return 5 ``` This PR invalidates the table relation in this case(`spark.sql.statistics.size.autoUpdate.enabled=false` and `table.stats.isEmpty`) to fix this issue. ## How was this patch tested? unit tests Closes #22721 from wangyum/SPARK-25403. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 11:41:34 +08:00
Ximo Guanter	604d6799df	[SPARK-30226][SQL] Remove withXXX functions in WriteBuilder ### What changes were proposed in this pull request? Adding a `LogicalWriteInfo` interface as suggested by cloud-fan in https://github.com/apache/spark/pull/25990#issuecomment-555132991 ### Why are the changes needed? It provides compile-time guarantees where we previously had none, which will make it harder to introduce bugs in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Compiles and passes tests Closes #26678 from edrevo/add-logical-write-info. Lead-authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com> Co-authored-by: Ximo Guanter Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-06 23:53:45 +08:00
angerszhu	3eade744f8	[SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf ### What changes were proposed in this pull request? Current catalyst rewrite non-correlated exists subquery to BroadcastNestLoopJoin, it's performance is not good , now we rewrite non-correlated EXISTS subquery to ScalaSubquery to optimize the performance. We rewrite ``` WHERE EXISTS (SELECT A FROM TABLE B WHERE COL1 > 10) ``` to ``` WHERE (SELECT 1 FROM (SELECT A FROM TABLE B WHERE COL1 > 10) LIMIT 1) IS NOT NULL ``` to avoid build join to solve EXISTS expression. ### Why are the changes needed? Optimize EXISTS performance. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Manuel Tested Closes #26437 from AngersZhuuuu/SPARK-29800. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-06 22:54:37 +08:00
Wenchen Fan	be4faafee4	Revert "[SPARK-23264][SQL] Make INTERVAL keyword optional when ANSI enabled" ### What changes were proposed in this pull request? Revert https://github.com/apache/spark/pull/20433 . ### Why are the changes needed? According to the SQL standard, the INTERVAL prefix is required: ``` <interval literal> ::= INTERVAL [ <sign> ] <interval string> <interval qualifier> <interval string> ::= <quote> <unquoted interval string> <quote> ``` ### Does this PR introduce any user-facing change? yes, but omitting the INTERVAL prefix is a new feature in 3.0 ### How was this patch tested? existing tests Closes #27080 from cloud-fan/interval. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-03 12:51:10 -08:00
yi.wu	e38964c442	[SPARK-29768][SQL][FOLLOW-UP] Improve handling non-deterministic filter of ScanOperation ### What changes were proposed in this pull request? 1. For `ScanOperation`, if it collects more than one filters, then all filters must be deterministic. And filter can be non-deterministic iff there's only one collected filter. 2. `FileSourceStrategy` should filter out non-deterministic filter, as it will hit haven't initialized exception if it's a partition related filter. ### Why are the changes needed? Strictly follow `CombineFilters`'s behavior which doesn't allow combine two filters where non-deterministic predicates exist. And avoid hitting exception for file source. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Test exists. Closes #27073 from Ngone51/SPARK-29768-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 21:48:14 +08:00
Kent Yao	c49388a484	[SPARK-30214][SQL] A new framework to resolve v2 commands ### What changes were proposed in this pull request? Currently, we have a v2 adapter for v1 catalog (`V2SessionCatalog`), all the table/namespace commands can be implemented via v2 APIs. Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples: - `DROP NAMESPACE`: only need to know the name of the namespace. - `DESC NAMESPACE`: need to lookup the namespace and get metadata, but is done during execution - `DROP TABLE`: need to do lookup and make sure it's a table not (temp) view. - `DESC TABLE`: need to lookup the table and get metadata. For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need the analyzer to do lookup. Note that, table and namespace have a difference: `DESC NAMESPACE testcat` works and describes the root namespace under `testcat`, while `DESC TABLE testcat` fails if there is no table `testcat` under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table. In this Pull Request, we introduce a new framework to resolve v2 commands: 1. parser creates logical plans or commands with `UnresolvedNamespace`/`UnresolvedTable`/`UnresolvedView`/`UnresolvedRelation`. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations) 2. analyzer converts 2.1 `UnresolvedNamespace` to `ResolvesNamespace` (contains catalog and namespace identifier) 2.2 `UnresolvedTable` to `ResolvedTable` (contains catalog, identifier and `Table`) 2.3 `UnresolvedView` to `ResolvedView` (will be added later when we migrate view commands) 2.4 `UnresolvedRelation` to relation. 3. an extra analyzer rule to match commands with `V1Table` and converts them to corresponding v1 commands. This will be added later when we migrate existing commands 4. planner matches commands and converts them to the corresponding physical nodes. We also introduce brand new v2 commands - the `comment` syntaxes to illustrate how to work with the newly added framework. ```sql COMMENT ON (DATABASE\|SCHEMA\|NAMESPACE) ... IS ... COMMENT ON TABLE ... IS ... ``` Details about the `comment` syntaxes: As the new design of catalog v2, some properties become reserved, e.g. `location`, `comment`. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands. They are the best practices from PostgreSQL and presto. https://www.postgresql.org/docs/12/sql-comment.html https://prestosql.io/docs/current/sql/comment.html Mostly, the basic thoughts of the new framework came from the discussions bellow with cloud-fan, https://github.com/apache/spark/pull/26847#issuecomment-564510061, ### Why are the changes needed? To make it easier to add new v2 commands, and easier to unify the table relation behavior. ### Does this PR introduce any user-facing change? yes, add new syntax ### How was this patch tested? add uts. Closes #26847 from yaooqinn/SPARK-30214. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 16:09:06 +08:00
Maxim Gekk	51373467cc	[SPARK-30412][SQL][TESTS] Eliminate warnings in Java tests regarding to deprecated Spark SQL API ### What changes were proposed in this pull request? In the PR, I propose to add the `SuppressWarnings("deprecation")` annotation to Java tests for deprecated Spark SQL APIs. ### Why are the changes needed? This eliminates the following warnings: ``` sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java Warning:Warning:line (32)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (91)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (100)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (109)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (118)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java Warning:Warning:line (28)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (37)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (46)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (55)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated Warning:Warning:line (64)java: org.apache.spark.sql.expressions.javalang.typed in org.apache.spark.sql.expressions.javalang has been deprecated sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java Warning:Warning:line (478)java: json(org.apache.spark.api.java.JavaRDD<java.lang.String>) in org.apache.spark.sql.DataFrameReader has been deprecated ``` and highlights warnings about real problems. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `Java8DatasetAggregatorSuite.java`, `JavaDataFrameSuite.java` and `JavaDatasetAggregatorSuite.java`. Closes #27081 from MaxGekk/eliminate-warnings-part2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-03 13:26:48 +09:00
Maxim Gekk	a469976e6e	[SPARK-29930][SQL][FOLLOW-UP] Allow only default value to be set for removed SQL configs ### What changes were proposed in this pull request? In the PR, I propose to throw `AnalysisException` when a removed SQL config is set to non-default value. The following SQL configs removed by #26559 are marked as removed: 1. `spark.sql.fromJsonForceNullableSchema` 2. `spark.sql.legacy.compareDateTimestampInTimestamp` 3. `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation` ### Why are the changes needed? To improve user experience with Spark SQL by notifying of removed SQL configs used by users. ### Does this PR introduce any user-facing change? Yes, before the `set` command was silently ignored: ```sql spark-sql> set spark.sql.fromJsonForceNullableSchema=false; spark.sql.fromJsonForceNullableSchema false ``` after the exception should be raised: ```sql spark-sql> set spark.sql.fromJsonForceNullableSchema=false; Error in query: The SQL config 'spark.sql.fromJsonForceNullableSchema' was removed in the version 3.0.0. It was removed to prevent errors like SPARK-23173 for non-default value.; ``` ### How was this patch tested? Added new tests into `SQLConfSuite` for both cases when removed SQL configs are set to default and non-default values. Closes #27057 from MaxGekk/remove-sql-configs-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-03 10:41:30 +09:00
Kent Yao	e04309cb1f	[SPARK-30341][SQL] Overflow check for interval arithmetic operations ### What changes were proposed in this pull request? 1. For the interval arithmetic functions, e.g. `add`/`subtract`/`negative`/`multiply`/`divide`, enable overflow check when `ANSI` is on. 2. For `multiply`/`divide`, throw an exception when an overflow happens in spite of `ANSI` is on/off. 3. `add`/`subtract`/`negative` stay the same for backward compatibility. 4. `divide` by 0 throws ArithmeticException whether `ANSI` or not as same as numerics. 5. These behaviors fit the numeric type operations fully when ANSI is on. 6. These behaviors fit the numeric type operations fully when ANSI is off, except 2 and 4. ### Why are the changes needed? 1. bug fix 2. `ANSI` support ### Does this PR introduce any user-facing change? When `ANSI` is on, interval `add`/`subtract`/`negative`/`multiply`/`divide` will overflow if any field overflows ### How was this patch tested? add unit tests Closes #26995 from yaooqinn/SPARK-30341. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 02:04:20 +08:00
Wenchen Fan	1743d5be7f	[SPARK-30284][SQL] CREATE VIEW should keep the current catalog and namespace ### What changes were proposed in this pull request? Update CREATE VIEW command to store the current catalog and namespace instead of current database in view metadata. Also update analyzer to leverage the catalog and namespace in view metastore to resolve relations inside views. Note that, this PR still keeps the way we resolve views, by recursively calling Analyzer. This is necessary because view text may contain CTE, window spec, etc. which needs rules outside of the main resolution batch (e.g. `CTESubstitution`) ### Why are the changes needed? To resolve relations inside view correctly. ### Does this PR introduce any user-facing change? Yes, fix a bug. Now tables referred by a view can be resolved correctly even if the current catalog/namespace has been updated. ### How was this patch tested? a new test Closes #26923 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 01:41:32 +08:00
jiake	6bd5494f34	[SPARK-30403][SQL] fix the NoSuchElementException when enable AQE with InSubquery expression ### What changes were proposed in this pull request? This PR aim to fix the NoSuchElementException exception when enable AQE with insubquery expression. ### Why are the changes needed? Fix exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? added new ut Closes #27068 from JkSelf/fixSubqueryIssue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 23:11:56 +08:00
jiake	05f7b57ddc	[SPARK-30407][SQL] fix the reset metric issue when enable AQE ### What changes were proposed in this pull request? When working on [PR#26813](https://github.com/apache/spark/pull/26813), we encounter the exception in [here(the number of metrics(1) is 2 not 1 )](`5d870ef0bc/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (L120)`). This PR fix the above exception. ### Why are the changes needed? Fix exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? [this test with enable AQE](`5d870ef0bc/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (L120)`) Closes #27074 from JkSelf/resetMetricsIssue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 21:55:36 +08:00
Maxim Gekk	b316d37365	[SPARK-30401][SQL] Call `requireNonStaticConf()` only once in `set()` ### What changes were proposed in this pull request? Calls of `requireNonStaticConf()` are removed from the `set()` methods in RuntimeConfig because those methods invoke `def set(key: String, value: String): Unit` where `requireNonStaticConf()` is called as well. ### Why are the changes needed? To avoid unnecessary calls of `requireNonStaticConf()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests from `SQLConfSuite` Closes #27062 from MaxGekk/call-requireNonStaticConf-once. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 09:56:50 +09:00
Jungtaek Lim (HeartSaVioR)	e054a0af6f	[SPARK-29348][SQL][FOLLOWUP] Fix slight bug on streaming example for Dataset.observe ### What changes were proposed in this pull request? This patch fixes a small bug in the example of streaming query, as the type of observable metrics is Java Map instead of Scala Map, so to use foreach it should be converted first. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Ran below query via `spark-shell`: Streaming ```scala import scala.collection.JavaConverters._ import scala.util.Random import org.apache.spark.sql.streaming.StreamingQueryListener import org.apache.spark.sql.streaming.StreamingQueryListener._ spark.streams.addListener(new StreamingQueryListener() { override def onQueryProgress(event: QueryProgressEvent): Unit = { event.progress.observedMetrics.asScala.get("my_event").foreach { row => // Trigger if the number of errors exceeds 5 percent val num_rows = row.getAs[Long]("rc") val num_error_rows = row.getAs[Long]("erc") val ratio = num_error_rows.toDouble / num_rows if (ratio > 0.05) { // Trigger alert println(s"alert! error ratio: $ratio") } } } def onQueryStarted(event: QueryStartedEvent): Unit = {} def onQueryTerminated(event: QueryTerminatedEvent): Unit = {} }) val rates = spark .readStream .format("rate") .option("rowsPerSecond", 10) .load val rand = new Random() val df = rates.map { row => (row.getLong(1), if (row.getLong(1) % 2 == 0) "error" else null) }.toDF val ds = df.selectExpr("_1 AS id", "_2 AS error") // Observe row count (rc) and error row count (erc) in the batch Dataset val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc")) observed_ds.writeStream.format("console").start() ``` Closes #27046 from HeartSaVioR/SPARK-29348-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:08:25 +09:00
HyukjinKwon	7079e871a7	[SPARK-30185][SQL] Implement Dataset.tail API ### What changes were proposed in this pull request? This PR proposes a `tail` API. Namely, as below: ```scala scala> spark.range(10).head(5) res1: Array[Long] = Array(0, 1, 2, 3, 4) scala> spark.range(10).tail(5) res2: Array[Long] = Array(5, 6, 7, 8, 9) ``` Implementation details will be similar with `head` but it will be reversed: 1. Run the job against the last partition and collect rows. If this is enough, return as is. 2. If this is not enough, calculate the number of partitions to select more based upon `spark.sql.limit.scaleUpFactor` 3. Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2. 4. Go to 2. Note that, we don't guarantee the natural order in DataFrame in general - there are cases when it's deterministic and when it's not. We probably should write down this as a caveat separately. ### Why are the changes needed? Many other systems support the way to take data from the end, for instance, pandas[1] and Python[2][3]. Scala collections APIs also have head and tail On the other hand, in Spark, we only provide a way to take data from the start (e.g., DataFrame.head). This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as Koalas[8]. In addition, this missing API seems explicitly mentioned in comparison to another system[9] time to time. It seems we're missing non-trivial use case in Spark and this motivated me to propose this API. [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail [2] https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line [3] https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python [4] http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html [5] https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index [6] https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe [7] https://issues.apache.org/jira/browse/SPARK-26433 [8] https://github.com/databricks/koalas/issues/343 [9] https://medium.com/chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 ### Does this PR introduce any user-facing change? No, (new API) ### How was this patch tested? Unit tests were added and manually tested. Closes #26809 from HyukjinKwon/wip-tail. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:07:09 +09:00
Xiao Li	919d551ddb	Revert "[SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions" This reverts commit `f926809a1f`. Closes #27032 from gatorsmile/revertSPARK-29390. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-29 15:25:14 -08:00
sandeep katta	16e5e79877	[SPARK-28670][SQL] create function should thrown Exception if the resource is not found ## What changes were proposed in this pull request? Create temporary or permanent function it should throw AnalysisException if the resource is not found. Need to keep behavior consistent across permanent and temporary functions. ## How was this patch tested? Added UT and also tested manually Before Fix If the UDF resource is not present then on creation of temporary function it throws AnalysisException where as for permanent function it does not throw. Permanent funtcion throws AnalysisException only after select operation is performed. After Fix For temporary and permanent function check for the resource, if the UDF resource is not found then throw AnalysisException ![rt](https://user-images.githubusercontent.com/35216143/62781519-d1131580-bad5-11e9-9d58-69e65be86c03.png) Closes #25399 from sandeep-katta/funcIssue. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-28 14:35:33 +09:00
lijunqing	a2de20c0e6	[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by ### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. Example SQL ``` SELECT /+ REPARTITION(5) / * FROM test ORDER BY a ``` BEFORE ``` == Physical Plan == (1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11] +- Exchange RoundRobinPartitioning(5), false, [id=#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` AFTER* ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes #26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <lijunqing@baidu.com> Co-authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 11:52:39 +08:00
gengjiaan	d59e7195f6	[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression ### What changes were proposed in this pull request? The filter predicate for aggregate expression is an `ANSI SQL`. ``` <aggregate function> ::= COUNT <left paren> <asterisk> <right paren> [ <filter clause> ] \| <general set function> [ <filter clause> ] \| <binary set function> [ <filter clause> ] \| <ordered set function> [ <filter clause> ] \| <array aggregate function> [ <filter clause> ] \| <row pattern count function> [ <filter clause> ] ``` There are some mainstream database support this syntax. PostgreSQL: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES For example: ``` SELECT year, count() FILTER (WHERE gdp_per_capita >= 40000) FROM countries GROUP BY year ``` ``` SELECT year, code, gdp_per_capita, count() FILTER (WHERE gdp_per_capita >= 40000) OVER (PARTITION BY year) FROM countries ``` jOOQ: https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/ Notice: 1.This PR only supports FILTER predicate without codegen. maropu will create another PR is related to SPARK-30027 to support codegen. 2.This PR only supports FILTER predicate without DISTINCT. I will create another PR is related to SPARK-30276 to support this. 3.This PR only supports FILTER predicate that can't reference the outer query. I created ticket SPARK-30219 to support it. 4.This PR only supports FILTER predicate that can't use IN/EXISTS predicate sub-queries. I created ticket SPARK-30220 to support it. 5.Spark SQL cannot supports a SQL with nested aggregate. I created ticket SPARK-30182 to support it. There are some show of the PR on my production environment. ``` spark-sql> desc gja_test_partition; key string NULL value string NULL other string NULL col2 int NULL # Partition Information # col_name data_type comment col2 int NULL Time taken: 0.79 s ``` ``` spark-sql> select * from gja_test_partition; a A ao 1 b B bo 1 c C co 1 d D do 1 e E eo 2 g G go 2 h H ho 2 j J jo 2 f F fo 3 k K ko 3 l L lo 4 i I io 4 Time taken: 1.75 s ``` ``` spark-sql> select count(key), sum(col2) from gja_test_partition; 12 26 Time taken: 1.848 s ``` ``` spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition; 8 Time taken: 2.926 s ``` ``` spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition; 14 Time taken: 2.087 s ``` ``` spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition; 8 14 Time taken: 2.847 s ``` ``` spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition; 12 8 26 14 Time taken: 1.787 s ``` ``` spark-sql> desc student; id int NULL name string NULL sex string NULL class_id int NULL Time taken: 0.206 s ``` ``` spark-sql> select * from student; 1 张三 man 1 2 李四 man 1 3 王五 man 2 4 赵六 man 2 5 钱小花 woman 1 6 赵九红 woman 2 7 郭丽丽 woman 2 Time taken: 0.786 s ``` ``` spark-sql> select class_id, count(id), sum(id) from student group by class_id; 1 3 8 2 4 20 Time taken: 18.783 s ``` ``` spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id; 1 2 5 2 2 13 Time taken: 3.887 s ``` ### Why are the changes needed? Add new SQL feature. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT and new UT. Closes #26656 from beliefer/support-aggregate-clause. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-26 17:41:50 +08:00
wenfang	4d58cd77f9	[SPARK-30330][SQL] Support single quotes json parsing for get_json_object and json_tuple ### What changes were proposed in this pull request? I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results. In my production environment, ytag is a json wrapped by single quotes,as follows ``` {'y1': 'shuma', 'y2': 'shuma:shouji'} {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} {'y1': 'yule', 'y2': 'yule:mingxing'} ``` Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing. So l provide this PR to resolve the question. ### Why are the changes needed? Enabled for Hive compatibility ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NEW TESTS Closes #26965 from wenfang6/enableSingleQuotesJsonForSparkSQL. Authored-by: wenfang <wenfang@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-26 11:45:31 +09:00
Kent Yao	da65a955ed	[SPARK-30266][SQL] Avoid match error and int overflow in ApproximatePercentile and Percentile ### What changes were proposed in this pull request? accuracyExpression can accept Long which may cause overflow error. accuracyExpression can accept fractions which are implicitly floored. accuracyExpression can accept null which is implicitly changed to 0. percentageExpression can accept null but cause MatchError. percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros. ##### cases ```sql select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success select percentile_approx(10.0, 0.5, null); -- null cast to 0 select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1 select percentile_approx(10.0, null, 1); -- scala.MatchError select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero. ``` ##### behavior before ```sql +select percentile_approx(10.0, 0.5, 2147483648) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +10.0 + +select percentile_approx(10.0, null, 1) +scala.MatchError +null + + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +[10.0,10.0,10.0] ``` ##### behavior after ```sql +select percentile_approx(10.0, 0.5, 2147483648) +10.0 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7 + +select percentile_approx(10.0, null, 1) +java.lang.IllegalArgumentException +The value of percentage must be be between 0.0 and 1.0, but got null + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +java.lang.IllegalArgumentException +Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null] ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, fix some improper usages of percentile_approx as cases list above ### How was this patch tested? add ut Closes #26905 from yaooqinn/SPARK-30266. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 20:03:26 +08:00
manuzhang	ef6f9e9668	[SPARK-30331][SQL] Set isFinalPlan to true before posting the final AdaptiveSparkPlan event ### What changes were proposed in this pull request? Set `isFinalPlan=true` before posting the final AdaptiveSparkPlan event (`SparkListenerSQLAdaptiveExecutionUpdate`) ### Why are the changes needed? Otherwise, any attempt to listen on the final event by pattern matching `isFinalPlan=true` would fail ### Does this PR introduce any user-facing change? No. ### How was this patch tested? All tests in `AdaptiveQueryExecSuite` are exteneded with a verification that a `SparkListenerSQLAdaptiveExecutionUpdate` event with `isFinalPlan=True` exists Closes #26983 from manuzhang/spark-30331. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 19:08:24 +08:00
Pavithra Ramachandran	57ca95246c	[SPARK-29505][SQL] Make DESC EXTENDED <table name> <column name> case insensitive ### What changes were proposed in this pull request? While querying using desc , if column name is not entered exactly as per the column name given during the table creation, the colstats are wrong. fetching of col stats has been made case insensitive. ### Why are the changes needed? functions like analyze, etc support case insensitive retrieval of column data. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? <!-- Unit test has been rewritten and tested. Closes #26927 from PavithraRamachandran/desc_caseinsensitive. Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-25 08:57:34 +09:00
wangguangxin.cn	640dcc435b	[SPARK-28332][SQL] Reserve init value -1 only when do min max statistics in SQLMetrics ### What changes were proposed in this pull request? This is an alternative solution to https://github.com/apache/spark/pull/25095. SQLMetrics use -1 as init value as a work around for [SPARK-11013](https://issues.apache.org/jira/browse/SPARK-11013.) However, it may bring out some badcases as https://github.com/apache/spark/pull/26726 reporting. In fact, we only need to reserve -1 when doing min max statistics in `SQLMetrics.stringValue` so that we can filter out those not initialized accumulators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs Closes #26899 from WangGuangxin/sqlmetrics. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-23 13:13:35 +08:00
HyukjinKwon	e5abbab0ed	[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: https://github.com/apache/spark/pull/24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: https://github.com/apache/spark/pull/24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: https://github.com/apache/spark/pull/24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) Note that `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text\|orc\|parquet\|json\|csv)` and `DataStreamReader.(text\|orc\|parquet\|json\|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ \|value\| +-----+ \| ac\| \| cc\| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-23 09:57:42 +09:00
gengjiaan	a38bf7e051	[SPARK-28083][SQL][TEST][FOLLOW-UP] Enable LIKE ... ESCAPE test cases ### What changes were proposed in this pull request? This PR is a follow-up to https://github.com/apache/spark/pull/25001 ### Why are the changes needed? No ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the newly update test files. Closes #26949 from beliefer/uncomment-like-escape-tests. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:40:07 -08:00
Kazuaki Ishizaki	f31d9a629b	[MINOR][DOC][SQL][CORE] Fix typo in document and comments ### What changes were proposed in this pull request? Fixed typo in `docs` directory and in other directories 1. Find typo in `docs` and apply fixes to files in all directories 2. Fix `the the` -> `the` ### Why are the changes needed? Better readability of documents ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test needed Closes #26976 from kiszk/typo_20191221. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:08:58 -08:00
Wenchen Fan	cd84400271	[SPARK-29906][SQL][FOLLOWUP] Update the final plan in UI for AQE ### What changes were proposed in this pull request? a followup of https://github.com/apache/spark/pull/26576, which mistakenly removes the UI update of the final plan. ### Why are the changes needed? fix mistake. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26968 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 09:56:15 -08:00
Sean Owen	7dff3b125d	[SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes ### What changes were proposed in this pull request? Remove usages of Guava that no longer work in Guava 27, and replace with workalikes. I'll comment on key types of changes below. ### Why are the changes needed? Hadoop 3.2.1 uses Guava 27, so this helps us avoid problems running on Hadoop 3.2.1+ and generally lowers our exposure to Guava. ### Does this PR introduce any user-facing change? Should not be, but see notes below on hash codes and toString. ### How was this patch tested? Existing tests will verify whether these changes break anything for Guava 14. I manually built with an updated version and it compiles with Guava 27; tests running manually locally now. Closes #26911 from srowen/SPARK-30272. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-20 08:55:04 -06:00
Prakhar Jain	07b04c4c72	[SPARK-29938][SQL] Add batching support in Alter table add partition flow ### What changes were proposed in this pull request? Add batching support in Alter table add partition flow. Also calculate new partition sizes faster by doing listing in parallel. ### Why are the changes needed? This PR split the the single createPartitions() call AlterTableAddPartition flow into smaller batches, which could prevent - SocketTimeoutException: Adding thousand of partitions in Hive metastore itself takes lot of time. Because of this hive client fails with SocketTimeoutException. - Hive metastore from OOM (caused by millions of partitions). It will also try to gather stats (total size of all files in all new partitions) faster by parallely listing the new partition paths. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Also tested on a cluster in HDI with 15000 partitions with remote metastore server. Without batching - operation fails with SocketTimeoutException, With batching it finishes in 25 mins. Closes #26569 from prakharjain09/add_partition_batching_r1. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-20 08:54:14 -06:00
Niranjan Artal	0d2ef3ae2b	[SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks ### What changes were proposed in this pull request? In this PR, For a given metrics id we are checking if the driver side accumulator's value is greater than max of all stages value. If it's true, then we are removing that entry from the Hashmap. By doing this, for this metrics, "driver" would be displayed on the UI(As the driver would have the maximum value) ### Why are the changes needed? This PR fixes https://issues.apache.org/jira/browse/SPARK-30300. Currently driver's metric value is not compared while caluculating the max. ### Does this PR introduce any user-facing change? For the metrics where driver's value is greater than max of all stages, this is the change. Previous : (min, median, max (stageId 0( attemptId 1): taskId 2)) Now: (min, median, max (driver)) ### How was this patch tested? Ran unit tests. Closes #26941 from nartal1/SPARK-30300. Authored-by: Niranjan Artal <nartal@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-12-20 07:29:28 -06:00
Kent Yao	12249fcdc7	[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types ### What changes were proposed in this pull request? When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘. https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712 cases to reproduce: ```sql +-- !query 47 +select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null) +-- !query 47 schema +struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>> +-- !query 47 output +[1582-10-03,1582-10-04,1582-10-15,null] + + +-- !query 48 +select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15' +-- !query 48 schema +struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date> +-- !query 48 output +1582-10-13 1582-10-14 1582-10-15 ``` other refencences https://github.com/h2database/h2database/issues/831 ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script ### How was this patch tested? add uts Closes #26942 from yaooqinn/SPARK-30301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-20 19:21:43 +08:00
jiake	a296d15235	[SPARK-30291] catch the exception when doing materialize in AQE ### What changes were proposed in this pull request? AQE need catch the exception when doing materialize. And then user can get more information about the exception when enable AQE. ### Why are the changes needed? provide more cause about the exception when doing materialize ### Does this PR introduce any user-facing change? Before this PR, the error in the added unit test is java.lang.RuntimeException: Invalid bucket file file:///${SPARK_HOME}/assembly/spark-warehouse/org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite/bucketed_table/part-00000-3551343c-d003-4ada-82c8-45c712a72efe-c000.snappy.parquet After this PR, the error in the added unit test is: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. ### How was this patch tested? Add a new ut Closes #26931 from JkSelf/catchMoreException. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-20 00:23:26 -08:00
Wenchen Fan	18e8d1d5b2	[SPARK-30307][SQL] remove ReusedQueryStageExec ### What changes were proposed in this pull request? When we reuse exchanges in AQE, what we produce is `ReuseQueryStage(QueryStage(Exchange))`. This PR changes it to `QueryStage(ReusedExchange(Exchange))`. This PR also fixes an issue in `LocalShuffleReaderExec.outputPartitioning`. We can only preserve the partitioning if we read one mapper per task. ### Why are the changes needed? `QueryStage` is light-weighted and we don't need to reuse its instance. What we really care is to reuse the exchange instance, which has heavy states (e.g. broadcasted valued, submitted map stage). To simplify the framework, we should use the existing `ReusedExchange` node to do the reuse work, instead of creating a new node. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26952 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-19 20:56:06 -08:00
Aman Omer	726f6d3e3c	[SPARK-30184][SQL] Implement a helper method for aliasing functions ### What changes were proposed in this pull request? This PR is to use `expressionWithAlias` for remaining functions for which alias name can be used. Remaining functions are: `Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp` PR https://github.com/apache/spark/pull/26712 introduced `expressionWithAlias` ### Why are the changes needed? Error message is wrong when alias name is used for above mentioned functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26808 from amanomer/fncAlias. Lead-authored-by: Aman Omer <amanomer1996@gmail.com> Co-authored-by: Aman Omer <40591404+amanomer@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-20 12:49:16 +08:00
Jungtaek Lim (HeartSaVioR)	ab87bfd087	[SPARK-29450][SS] Measure the number of output rows for streaming aggregation with append mode ### What changes were proposed in this pull request? This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it. ### Why are the changes needed? Without the patch, the value for such metric is always 0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test added. Also manually tested with below query: > query ``` import spark.implicits._ spark.conf.set("spark.sql.shuffle.partitions", "5") val df = spark.readStream .format("rate") .option("rowsPerSecond", 1000) .load() .withWatermark("timestamp", "5 seconds") .selectExpr("timestamp", "mod(value, 100) as mod", "value") .groupBy(window($"timestamp", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) val query = df .writeStream .format("memory") .option("queryName", "test") .outputMode("append") .start() query.awaitTermination() ``` > before the patch ![screenshot-before-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023217-58d7bc80-0a01-11ea-8cac-40f1cced6d16.png) > after the patch ![screenshot-after-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023221-5c6b4380-0a01-11ea-8a66-7bf1b7d09fc7.png) Closes #26104 from HeartSaVioR/SPARK-29450. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-19 18:20:41 +09:00
Xingbo Jiang	2af5237fe8	[SPARK-29918][SQL][FOLLOWUP][TEST] Fix arrayOffset in `RecordBinaryComparatorSuite` ### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/26548#pullrequestreview-334345333, some test cases in `RecordBinaryComparatorSuite` use a fixed arrayOffset when writing to long arrays, this could lead to weird stuff including crashing with a SIGSEGV. This PR fix the problem by computing the arrayOffset based on `Platform.LONG_ARRAY_OFFSET`. ### How was this patch tested? Tested locally. Previously, when we try to add `System.gc()` between write into long array and compare by RecordBinaryComparator, there is a chance to hit JVM crash with SIGSEGV like: ``` # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007efc66970bcb, pid=11831, tid=0x00007efc0f9f9700 # # JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) # Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x5fbbcb] # # Core dump written. Default location: /home/jenkins/workspace/sql/core/core or core.11831 # # An error report file with more information is saved as: # /home/jenkins/workspace/sql/core/hs_err_pid11831.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # ``` After the fix those test cases didn't crash the JVM anymore. Closes #26939 from jiangxb1987/rbc. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-19 17:01:40 +08:00
Gengliang Wang	ab8eb86a77	Revert "[SPARK-29629][SQL] Support typed integer literal expression" This reverts commit `8e667db5d8`. Closes #26940 from gengliangwang/revert_Spark_29629. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-19 16:34:27 +09:00
Jalpan Randeri	f15eee18cc	[SPARK-29493][SQL] Arrow MapType support ### What changes were proposed in this pull request? This pull request add support for Arrow MapType into Spark SQL. ### Why are the changes needed? Without this change User's of spark are not able to query data in spark if one of columns is stored as map and Apache Arrow execution mode is preferred by user. More info: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-29493 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Introduced few unit tests around map type in existing arrow test suit Closes #26512 from jalpan-randeri/feature-arrow-java-map-type. Authored-by: Jalpan Randeri <randerij@amazon.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-18 23:59:27 +09:00
Kent Yao	d38f816748	[MINOR][SQL][DOC] Fix some format issues in Dataset API Doc ### What changes were proposed in this pull request? fix listing up format issues in Dataset API Doc (scala & java) ### Why are the changes needed? improve doc ### Does this PR introduce any user-facing change? yes, API doc changing ### How was this patch tested? no Closes #26922 from yaooqinn/datasetdoc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-18 15:25:40 +09:00
Aman Omer	297f406425	[SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType ### What changes were proposed in this pull request? Use `TypeCoercion.findWiderTypeForTwo()` instead of `TypeCoercion.findTightestCommonType()` while preprocessing `inputTypes` in `ArrayContains`. ### Why are the changes needed? `TypeCoercion.findWiderTypeForTwo()` also handles cases for DecimalType. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Test cases to be added. Closes #26811 from amanomer/29600. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-18 01:30:28 +08:00
Zhenhua Wang	18431c7baa	[SPARK-30269][SQL] Should use old partition stats to decide whether to update stats when analyzing partition ### What changes were proposed in this pull request? It's an obvious bug: currently when analyzing partition stats, we use old table stats to compare with newly computed stats to decide whether it should update stats or not. ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add new tests Closes #26908 from wzhfy/failto_update_part_stats. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-17 22:21:26 +09:00
Kent Yao	bf7215c510	[SPARK-30066][SQL][FOLLOWUP] Remove size field for interval column cache ### What changes were proposed in this pull request? A followup for #26699, clear the size field for interval column cache, which is needless and can reduce the memory cost. ### Why are the changes needed? followup ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing ut. Closes #26906 from yaooqinn/SPARK-30066-f. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-17 15:36:21 +09:00
Terry Kim	e75d9afb2f	[SPARK-30094][SQL] Apply current namespace for the single-part table name ### What changes were proposed in this pull request? This PR applies the current namespace for the single-part table name if the current catalog is a non-session catalog. Note that the reason the current namespace is not applied for the session catalog is that the single-part name could be referencing a temp view which doesn't belong to any namespaces. The empty namespace for a table inside the session catalog is resolved by the session catalog implementation. ### Why are the changes needed? It's fixing the following bug where the current namespace is not respected: ``` sql("CREATE TABLE testcat.ns.t USING foo AS SELECT 1 AS id") sql("USE testcat.ns") sql("SHOW CURRENT NAMESPACE").show +-------+---------+ \|catalog\|namespace\| +-------+---------+ \|testcat\| ns\| +-------+---------+ // `t` is not resolved since the current namespace `ns` is not used. sql("DESCRIBE t").show Failed to analyze query: org.apache.spark.sql.AnalysisException: Table not found: t;; ``` ### Does this PR introduce any user-facing change? Yes, the above `DESCRIBE` command will succeed. ### How was this patch tested? Added tests. Closes #26894 from imback82/current_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-17 11:13:27 +08:00
Yuming Wang	696288f623	[INFRA] Reverts commit `56dcd79` and `c216ef1` ### What changes were proposed in this pull request? 1. Revert "Preparing development version 3.0.1-SNAPSHOT": `56dcd79` 2. Revert "Preparing Spark release v3.0.0-preview2-rc2": `c216ef1` ### Why are the changes needed? Shouldn't change master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test: https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master Closes #26915 from wangyum/revert-master. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-12-16 19:57:44 -07:00
Yuming Wang	56dcd79992	Preparing development version 3.0.1-SNAPSHOT	2019-12-17 01:57:27 +00:00
Yuming Wang	c216ef1d03	Preparing Spark release v3.0.0-preview2-rc2	2019-12-17 01:57:21 +00:00
Maxim Gekk	b03ce63c05	[SPARK-30258][TESTS] Eliminate warnings of deprecated Spark APIs in tests ### What changes were proposed in this pull request? In the PR, I propose to move all tests that use deprecated Spark APIs to separate test classes, and add the annotation: ```scala deprecated("This test suite will be removed.", "3.0.0") ``` The annotation suppress warnings from already deprecated methods and classes. ### Why are the changes needed? The warnings about deprecated Spark APIs in tests does not indicate any issues because the tests use such APIs intentionally. Eliminating the warnings allows to highlight other warnings that could show real problems. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites and by - DeprecatedAvroFunctionsSuite - DeprecatedDateFunctionsSuite - DeprecatedDatasetAggregatorSuite - DeprecatedStreamingAggregationSuite - DeprecatedWholeStageCodegenSuite Closes #26885 from MaxGekk/eliminate-deprecate-warnings. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-16 18:24:32 -06:00
Niranjan Artal	dddfeca175	[SPARK-30209][SQL][WEB-UI] Display stageId, attemptId and taskId for max metrics in Spark UI ### What changes were proposed in this pull request? SPARK-30209 discusses about adding additional metrics such as stageId, attempId and taskId for max metrics. We have the data required to display in LiveStageMetrics. Need to capture and pass these metrics to display on the UI. To minimize memory used for variables, we are saving maximum of each metric id per stage. So per stage additional memory usage is (#metrics * 4 * sizeof(Long)). Then max is calculated for each metric id among all stages which is passed in the stringValue method. Memory used is minimal. Ran the benchmark for runtime. Stage.Proc time has increased to around 1.5-2.5x but the Aggregate time has decreased. ### Why are the changes needed? These additional metrics stageId, attemptId and taskId could help in debugging the jobs quicker. For a given operator, it will be easy to identify the task which is taking maximum time to complete from the SQL tab itself. ### Does this PR introduce any user-facing change? Yes. stageId, attemptId and taskId is shown only for executor side metrics. For driver metrics, "(driver)" is displayed on UI. ![image (3)](https://user-images.githubusercontent.com/50492963/70763041-929d9980-1d07-11ea-940f-88ac6bdce9b5.png) "Driver" ![image (4)](https://user-images.githubusercontent.com/50492963/70763043-94675d00-1d07-11ea-95ab-3478728cb435.png) ### How was this patch tested? Manually tested, ran benchmark script for runtime. Closes #26843 from nartal1/SPARK-30209. Authored-by: Niranjan Artal <nartal@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-12-16 15:27:34 -06:00
HyukjinKwon	23b1312324	[SPARK-30200][DOCS][FOLLOW-UP] Add documentation for explain(mode: String) ### What changes were proposed in this pull request? This PR adds the documentation of the new `mode` added to `Dataset.explain`. ### Why are the changes needed? To let users know the new modes. ### Does this PR introduce any user-facing change? No (doc-only change). ### How was this patch tested? Manually built the doc: ![Screen Shot 2019-12-16 at 3 34 28 PM](https://user-images.githubusercontent.com/6477701/70884617-d64f1680-2019-11ea-9336-247ade7f8768.png) Closes #26903 from HyukjinKwon/SPARK-30200-doc. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 21:35:37 +09:00
Wenchen Fan	fdcd0e71b9	[SPARK-30192][SQL] support column position in DS v2 ### What changes were proposed in this pull request? update DS v2 API to support add/alter column with column position ### Why are the changes needed? We have a parser rule for column position, but we fail the query if it's specified, because the builtin catalog can't support add/alter column with column position. Since we have the catalog plugin API now, we should let the catalog implementation to decide if it supports column position or not. ### Does this PR introduce any user-facing change? not yet ### How was this patch tested? new tests Closes #26817 from cloud-fan/parser. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-16 18:55:17 +08:00
Boris Boutkov	3bf5498b4a	[MINOR][DOCS] Fix documentation for slide function ### What changes were proposed in this pull request? This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info. ### Why are the changes needed? Documentation improvement ### Does this PR introduce any user-facing change? No (doc-only change). ### How was this patch tested? Manually tested by documentation build. Closes #26896 from bboutkov/pyspark_doc_fix. Authored-by: Boris Boutkov <boris.boutkov@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 16:29:09 +09:00
HyukjinKwon	0a2afcec7d	[SPARK-30200][SQL][FOLLOW-UP] Expose only explain(mode: String) in Scala side, and clean up related codes ### What changes were proposed in this pull request? This PR mainly targets: 1. Expose only explain(mode: String) in Scala side 2. Clean up related codes - Hide `ExplainMode` under private `execution` package. No particular reason but just because `ExplainUtils` exists there - Use `case object` + `trait` pattern in `ExplainMode` to look after `ParseMode`. - Move `Dataset.toExplainString` to `QueryExecution.explainString` to look after `QueryExecution.simpleString`, and deduplicate the codes at `ExplainCommand`. - Use `ExplainMode` in `ExplainCommand` too. - Add `explainString` to `PythonSQLUtils` to avoid unexpected test failure of PySpark during refactoring Scala codes side. ### Why are the changes needed? To minimised exposed APIs, deduplicate, and clean up. ### Does this PR introduce any user-facing change? `Dataset.explain(mode: ExplainMode)` will be removed (which only exists in master). ### How was this patch tested? Manually tested and existing tests should cover. Closes #26898 from HyukjinKwon/SPARK-30200-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 14:42:35 +09:00
Maxim Gekk	67b644c3d7	[SPARK-30166][SQL] Eliminate compilation warnings in JSONOptions ### What changes were proposed in this pull request? In the PR, I propose to replace `setJacksonOptions()` in `JSONOptions` by `buildJsonFactory()` which builds `JsonFactory` using `JsonFactoryBuilder`. This allows to avoid using deprecated feature configurations from `JsonParser.Feature`. ### Why are the changes needed? - The changes eliminate the following compilation warnings in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala`: ``` Warning:Warning:line (137)Java enum ALLOW_NUMERIC_LEADING_ZEROS in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_NUMERIC_LEADING_ZEROS, allowNumericLeadingZeros) Warning:Warning:line (138)Java enum ALLOW_NON_NUMERIC_NUMBERS in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, allowNonNumericNumbers) Warning:Warning:line (139)Java enum ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, Warning:Warning:line (141)Java enum ALLOW_UNQUOTED_CONTROL_CHARS in Java enum Feature is deprecated: see corresponding Javadoc for more information. factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars) ``` - This put together building JsonFactory and set options from JSONOptions. So, we will not forget to call `setJacksonOptions` in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `JsonSuite`, `JsonFunctionsSuite`, `JsonExpressionsSuite`. Closes #26797 from MaxGekk/eliminate-warning. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-15 08:45:57 -06:00
fuwhu	4cbef8988e	[SPARK-30259][SQL] Fix CREATE TABLE behavior when session catalog is specified explicitly ### What changes were proposed in this pull request? Fix bug : CREATE TABLE throw error when session catalog specified explicitly. ### Why are the changes needed? Currently, Spark throw error when the session catalog is specified explicitly in "CREATE TABLE" and "CREATE TABLE AS SELECT" command, eg. > CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i; the error message is like below: > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl > 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog > 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog > 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException > Error in query: Database 'spark_catalog' not found; ### Does this PR introduce any user-facing change? Yes, after this PR, "CREATE TALBE" and "CREATE TABLE AS SELECT" can complete successfully when session catalog "spark_catalog" specified explicitly. ### How was this patch tested? New unit tests added. Closes #26887 from fuwhu/SPARK-30259. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 15:36:14 -08:00
Takeshi Yamamuro	f483a13d4a	[SPARK-30231][SQL][PYTHON][FOLLOWUP] Make error messages clear in PySpark df.explain ### What changes were proposed in this pull request? This pr is a followup of #26861 to address minor comments from viirya. ### Why are the changes needed? For better error messages. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Closes #26886 from maropu/SPARK-30231-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 14:26:50 -08:00
Kent Yao	d3ec8b1735	[SPARK-30066][SQL] Support columnar execution on interval types ### What changes were proposed in this pull request? Columnar execution support for interval types ### Why are the changes needed? support cache tables with interval columns improve performance too ### Does this PR introduce any user-facing change? Yes cache table with accept interval columns ### How was this patch tested? add ut Closes #26699 from yaooqinn/SPARK-30066. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 13:10:46 -08:00
Burak Yavuz	4c37a8a3f4	[SPARK-30143][SS] Add a timeout on stopping a streaming query ### What changes were proposed in this pull request? Add a timeout configuration for StreamingQuery.stop() ### Why are the changes needed? The stop() method on a Streaming Query awaits the termination of the stream execution thread. However, the stream execution thread may block forever depending on the streaming source implementation (like in Kafka, which runs UninterruptibleThreads). This causes control flow applications to hang indefinitely as well. We'd like to introduce a timeout to stop the execution thread, so that the control flow thread can decide to do an action if a timeout is hit. ### Does this PR introduce any user-facing change? By default, no. If the timeout configuration is set, then a TimeoutException will be thrown if a stream cannot be stopped within the given timeout. ### How was this patch tested? Unit tests Closes #26771 from brkyvz/stopTimeout. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-12-13 15:16:00 -08:00
Terry Kim	ac9b1881a2	[SPARK-30248][SQL] Fix DROP TABLE behavior when session catalog name is provided in the identifier ### What changes were proposed in this pull request? If a table name is qualified with session catalog name `spark_catalog`, the `DROP TABLE` command fails. For example, the following ``` sql("CREATE TABLE tbl USING json AS SELECT 1 AS i") sql("DROP TABLE spark_catalog.tbl") ``` fails with: ``` org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spark_catalog' not found; at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:45) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:336) ``` This PR correctly resolves `spark_catalog` as a catalog. ### Why are the changes needed? It's fixing a bug. ### Does this PR introduce any user-facing change? Yes, now, the `spark_catalog.tbl` in the above example is dropped as expected. ### How was this patch tested? Added a test. Closes #26878 from imback82/fix_drop_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-13 21:45:35 +08:00
Takeshi Yamamuro	64c7b94d64	[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain ### What changes were proposed in this pull request? This pr intends to support explain modes implemented in #26829 for PySpark. ### Why are the changes needed? For better debugging info. in PySpark dataframes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #26861 from maropu/ExplainModeInPython. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-13 17:44:23 +09:00
Jungtaek Lim (HeartSaVioR)	94eb66593a	[SPARK-30227][SQL] Add close() on DataWriter interface ### What changes were proposed in this pull request? This patch adds close() method to the DataWriter interface, which will become the place to cleanup the resource. ### Why are the changes needed? The lifecycle of DataWriter instance ends at either commit() or abort(). That makes datasource implementors to feel they can place resource cleanup in both sides, but abort() can be called when commit() fails; so they have to ensure they don't do double-cleanup if cleanup is not idempotent. ### Does this PR introduce any user-facing change? Depends on the definition of user; if they're developers of custom DSv2 source, they have to add close() in their DataWriter implementations. It's OK to just add close() with empty content as they should have already dealt with resource cleanup in commit/abort, but they would love to migrate the resource cleanup logic to close() as it avoids double cleanup. If they're just end users using the provided DSv2 source (regardless of built-in/3rd party), no change. ### How was this patch tested? Existing tests. Closes #26855 from HeartSaVioR/SPARK-30227. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-13 16:12:41 +08:00
Pablo Langa	cb6d2b3f83	[SPARK-30040][SQL] DROP FUNCTION should do multi-catalog resolution ### What changes were proposed in this pull request? Add DropFunctionStatement and make DROP FUNCTION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing DROP FUNCTION namespace.function ### Does this PR introduce any user-facing change? Yes. When running DROP FUNCTION namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26854 from planga82/feature/SPARK-30040_DropFunctionV2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-12 15:15:54 -08:00
Anton Okolnychyi	5114389aef	[SPARK-30107][SQL] Expose nested schema pruning to all V2 sources ### What changes were proposed in this pull request? This PR exposes the existing logic for nested schema pruning to all sources, which is in line with the description of `SupportsPushDownRequiredColumns` . Right now, `SchemaPruning` (rule, not helper utility) is applied in the optimizer directly on certain instances of `Table` ignoring `SupportsPushDownRequiredColumns` that is part of `ScanBuilder`. I think it would be cleaner to perform schema pruning and filter push-down in one place. Therefore, this PR moves all the logic into `V2ScanRelationPushDown`. ### Why are the changes needed? This change allows all V2 data sources to benefit from nested column pruning (if they support it). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This PR mostly relies on existing tests. On top, it adds one test to verify that top-level schema pruning works as well as one test for predicates with subqueries. Closes #26751 from aokolnychyi/nested-schema-pruning-ds-v2. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-12 13:40:46 -08:00
HyukjinKwon	cc087a3ac5	[SPARK-30162][SQL] Add PushedFilters to metadata in Parquet DSv2 implementation ### What changes were proposed in this pull request? This PR proposes to add `PushedFilters` into metadata to show the pushed filters in Parquet DSv2 implementation. In case of ORC, it is already added at https://github.com/apache/spark/pull/24719/files#diff-0fc82694b20da3cd2cbb07206920eef7R62-R64 ### Why are the changes needed? In order for users to be able to debug, and to match with ORC. ### Does this PR introduce any user-facing change? ```scala spark.range(10).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").filter("5 > id").explain() ``` Before: ``` == Physical Plan == (1) Project [id#20L] +- (1) Filter (isnotnull(id#20L) AND (5 > id#20L)) +- (1) ColumnarToRow +- BatchScan[id#20L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint> ``` After:* ``` == Physical Plan == (1) Project [id#13L] +- (1) Filter (isnotnull(id#13L) AND (5 > id#13L)) +- *(1) ColumnarToRow +- BatchScan[id#13L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint>, PushedFilters: [IsNotNull(id), LessThan(id,5)] ``` ### How was this patch tested? Unittest were added and manually tested. Closes #26857 from HyukjinKwon/SPARK-30162. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-12 08:33:33 -08:00
Aaron Lau	fd39b6db34	[SQL] Typo in HashedRelation error ### What changes were proposed in this pull request? Fixed typo in exception message of HashedRelations ### Why are the changes needed? Better exception messages ### Does this PR introduce any user-facing change? No ### How was this patch tested? No tests needed Closes #26822 from aaron-lau/master. Authored-by: Aaron Lau <aaron.lau@datadoghq.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-12 08:42:18 -06:00
root1	2936507f94	[SPARK-30150][SQL] ADD FILE, ADD JAR, LIST FILE & LIST JAR Command do not accept quoted path ### What changes were proposed in this pull request? `add file "abc.txt"` and `add file 'abc.txt'` are not supported. For these two spark sql gives `FileNotFoundException`. Only `add file abc.txt` is supported currently. After these changes path can be given as quoted text for ADD FILE, ADD JAR, LIST FILE, LIST JAR commands in spark-sql ### Why are the changes needed? In many of the spark-sql commands (like create table ,etc )we write path in quoted format only. To maintain this consistency we should support quoted format with this command as well. ### Does this PR introduce any user-facing change? Yes. Now users can write path with quotes. ### How was this patch tested? Manually tested. Closes #26779 from iRakson/SPARK-30150. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-12 17:11:21 +08:00
Terry Kim	3741a36ebf	[SPARK-30104][SQL][FOLLOWUP] V2 catalog named 'global_temp' should always be masked ### What changes were proposed in this pull request? This is a follow up to #26741 to address the following: 1. V2 catalog named `global_temp` should always be masked. 2. #26741 introduces `CatalogAndIdentifer` that supersedes `CatalogObjectIdentfier`. This PR removes `CatalogObjectIdentfier` and its usages and replace them with `CatalogAndIdentifer`. 3. `CatalogObjectIdentifier(catalog, ident) if !isSessionCatalog(catalog)` and `CatalogObjectIdentifier(catalog, ident) if isSessionCatalog(catalog)` are replaced with `NonSessionCatalogAndIdentifier` and `SessionCatalogAndIdentifier` respectively. ### Why are the changes needed? To fix an existing with handling v2 catalog named `global_temp` and to simplify the code base. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new tests. Closes #26853 from imback82/lookup_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-12 14:47:20 +08:00
jiake	1ced6c1544	[SPARK-30213][SQL] Remove the mutable status in ShuffleQueryStageExec ### What changes were proposed in this pull request? Currently `ShuffleQueryStageExec `contain the mutable status, eg `mapOutputStatisticsFuture `variable. So It is not easy to pass when we copy `ShuffleQueryStageExec`. This PR will put the `mapOutputStatisticsFuture ` variable from `ShuffleQueryStageExec` to `ShuffleExchangeExec`. And then we can pass the value of `mapOutputStatisticsFuture ` when copying. ### Why are the changes needed? In order to remove the mutable status in `ShuffleQueryStageExec` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing uts Closes #26846 from JkSelf/removeMutableVariable. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 19:39:31 -08:00
Pablo Langa	9cf9304e17	[SPARK-30038][SQL] DESCRIBE FUNCTION should do multi-catalog resolution ### What changes were proposed in this pull request? Add DescribeFunctionsStatement and make DESCRIBE FUNCTIONS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing DESCRIBE FUNCTIONS namespace.function ### Does this PR introduce any user-facing change? Yes. When running DESCRIBE FUNCTIONS namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26840 from planga82/feature/SPARK-30038_DescribeFunction_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 14:02:58 -08:00
Sean Owen	33f53cb2d5	[SPARK-30195][SQL][CORE][ML] Change some function, import definitions to work with stricter compiler in Scala 2.13 ### What changes were proposed in this pull request? See https://issues.apache.org/jira/browse/SPARK-30195 for the background; I won't repeat it here. This is sort of a grab-bag of related issues. ### Why are the changes needed? To cross-compile with Scala 2.13 later. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests for 2.12. I've been manually checking that this actually resolves the compile problems in 2.13 separately. Closes #26826 from srowen/SPARK-30195. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 12:33:58 -08:00
Maxim Gekk	e933539cdd	[SPARK-29864][SPARK-29920][SQL] Strict parsing of day-time strings to intervals ### What changes were proposed in this pull request? In the PR, I propose new implementation of `fromDayTimeString` which strictly parses strings in day-time formats to intervals. New implementation accepts only strings that match to a pattern defined by the `from` and `to`. Here is the mapping of user's bounds and patterns: - `[+\|-]D+ H[H]:m[m]:s[s][.SSSSSSSSS]` for DAY TO SECOND - `[+\|-]D+ H[H]:m[m]` for DAY TO MINUTE - `[+\|-]D+ H[H]` for DAY TO HOUR - `[+\|-]H[H]:m[m]s[s][.SSSSSSSSS]` for HOUR TO SECOND - `[+\|-]H[H]:m[m]` for HOUR TO MINUTE - `[+\|-]m[m]:s[s][.SSSSSSSSS]` for MINUTE TO SECOND Closes #26327 Closes #26358 ### Why are the changes needed? - Improve user experience with Spark SQL, and respect to the bound specified by users. - Behave the same as other broadly used DBMS - Oracle and MySQL. ### Does this PR introduce any user-facing change? Yes, before: ```sql spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE; interval 1 weeks 3 days 11 hours 12 minutes ``` After: ```sql spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE; Error in query: requirement failed: Interval string must match day-time format of '^(?<sign>[+\|-])?(?<hour>\d{1,2}):(?<minute>\d{1,2})$': 10 11:12:13.123(line 1, pos 16) == SQL == SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE ----------------^^^ ``` ### How was this patch tested? - Added tests to `IntervalUtilsSuite` - By `ExpressionParserSuite` - Updated `literals.sql` Closes #26473 from MaxGekk/strict-from-daytime-string. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-12 01:08:53 +08:00
Takeshi Yamamuro	a59cb13cda	[SPARK-30200][SQL][FOLLOWUP] Fix typo in ExplainMode ### What changes were proposed in this pull request? This pr is a follow-up of #26829 to fix typos in ExplainMode. ### Why are the changes needed? For better docs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #26851 from maropu/SPARK-30200-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 08:17:53 -08:00
Terry Kim	beae14d5ed	[SPARK-30104][SQL] Fix catalog resolution for 'global_temp' ### What changes were proposed in this pull request? `global_temp` is used as a database name to access global temp views. The current catalog lookup logic considers only the first element of multi-part name when it resolves a catalog. This results in using the session catalog even `global_temp` is used as a table name under v2 catalog. This PR addresses this by making sure multi-part name has two elements before using the session catalog. ### Why are the changes needed? Currently, 'global_temp' can be used as a table name in certain commands (CREATE) but not in others (DESCRIBE): ``` // Assume "spark.sql.globalTempDatabase" is set to "global_temp". sql(s"CREATE TABLE testcat.t (id bigint, data string) USING foo") sql(s"CREATE TABLE testcat.global_temp (id bigint, data string) USING foo") sql("USE testcat") sql(s"DESCRIBE TABLE t").show +---------------+---------+-------+ \| col_name\|data_type\|comment\| +---------------+---------+-------+ \| id\| bigint\| \| \| data\| string\| \| \| \| \| \| \| # Partitioning\| \| \| \|Not partitioned\| \| \| +---------------+---------+-------+ sql(s"DESCRIBE TABLE global_temp").show org.apache.spark.sql.AnalysisException: Table not found: global_temp;; 'DescribeTable 'UnresolvedV2Relation [global_temp], org.apache.spark.sql.connector.InMemoryTableSessionCatalog2f1af64f, `global_temp`, false at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122) ``` ### Does this PR introduce any user-facing change? Yes, `sql(s"DESCRIBE TABLE global_temp").show` in the above example now displays: ``` +---------------+---------+-------+ \| col_name\|data_type\|comment\| +---------------+---------+-------+ \| id\| bigint\| \| \| data\| string\| \| \| \| \| \| \| # Partitioning\| \| \| \|Not partitioned\| \| \| +---------------+---------+-------+ ``` instead of throwing an exception. ### How was this patch tested? Added new tests. Closes #26741 from imback82/global_temp. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 16:56:42 +08:00
Sean Owen	3cc55f6a0a	[SPARK-29392][CORE][SQL][FOLLOWUP] More removal of 'foo Symbol syntax for Scala 2.13 ### What changes were proposed in this pull request? Another continuation of https://github.com/apache/spark/pull/26748 ### Why are the changes needed? To cleanly cross compile with Scala 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests Closes #26842 from srowen/SPARK-29392.4. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 19:41:24 -08:00
Kent Yao	8f0eb7dc86	[SPARK-29587][SQL] Support SQL Standard type real as float(4) numeric as decimal ### What changes were proposed in this pull request? The types decimal and numeric are equivalent. Both types are part of the SQL standard. the real type is 4 bytes, variable-precision, inexact, 6 decimal digits precision, same as our float, part of the SQL standard. ### Why are the changes needed? improve sql standard support other dbs https://www.postgresql.org/docs/9.3/datatype-numeric.html https://prestodb.io/docs/current/language/types.html#floating-point http://www.sqlservertutorial.net/sql-server-basics/sql-server-data-types/ MySQL treats REAL as a synonym for DOUBLE PRECISION (a nonstandard variation), unless the REAL_AS_FLOAT SQL mode is enabled. In MySQL, NUMERIC is implemented as DECIMAL, so the following remarks about DECIMAL apply equally to NUMERIC. ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #26537 from yaooqinn/SPARK-29587. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 02:22:08 +08:00
Takeshi Yamamuro	6103cf1960	[SPARK-30200][SQL] Add ExplainMode for Dataset.explain ### What changes were proposed in this pull request? This pr intends to add `ExplainMode` for explaining `Dataset/DataFrame` with a given format mode (`ExplainMode`). `ExplainMode` has four types along with the SQL EXPLAIN command: `Simple`, `Extended`, `Codegen`, `Cost`, and `Formatted`. For example, this pr enables users to explain DataFrame/Dataset with the `FORMATTED` format implemented in #24759; ``` scala> spark.range(10).groupBy("id").count().explain(ExplainMode.Formatted) == Physical Plan == * HashAggregate (3) +- * HashAggregate (2) +- * Range (1) (1) Range [codegen id : 1] Output: [id#0L] (2) HashAggregate [codegen id : 1] Input: [id#0L] (3) HashAggregate [codegen id : 1] Input: [id#0L, count#8L] ``` This comes from [the cloud-fan suggestion.](https://github.com/apache/spark/pull/24759#issuecomment-560211270) ### Why are the changes needed? To follow the SQL EXPLAIN command. ### Does this PR introduce any user-facing change? No, this is just for a new API in Dataset. ### How was this patch tested? Add tests in `ExplainSuite`. Closes #26829 from maropu/DatasetExplain. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 09:51:29 -08:00
Yuanjian Li	d9b3069412	[SPARK-30125][SQL] Remove PostgreSQL dialect ### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 01:22:34 +08:00
Anton Okolnychyi	a9f1809a2a	[SPARK-30206][SQL] Rename normalizeFilters in DataSourceStrategy to be generic ### What changes were proposed in this pull request? This PR renames `normalizeFilters` in `DataSourceStrategy` to be more generic as the logic is not specific to filters. ### Why are the changes needed? These changes are needed to support PR #26751. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26830 from aokolnychyi/rename-normalize-exprs. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 07:49:22 -08:00
yi.wu	aa9da9365f	[SPARK-30151][SQL] Issue better error message when user-specified schema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by https://github.com/apache/spark/pull/25248#issuecomment-559594305, user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-10 20:56:21 +08:00
Sean Owen	36fa1980c2	[SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 compatibility; remove WrappedArray ### What changes were proposed in this pull request? Use Seq instead of Array in sc.parallelize, with reference types. Remove usage of WrappedArray. ### Why are the changes needed? These both enable building on Scala 2.13. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #26787 from srowen/SPARK-30158. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-09 14:41:48 -06:00
Jungtaek Lim (HeartSaVioR)	538b8d101c	[SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports ### What changes were proposed in this pull request? This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate. ### Why are the changes needed? Java linter starts complaining. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? lint-java passed locally This closes #26819 Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-09 08:57:20 -08:00
Gengliang Wang	a717d219a6	[SPARK-30159][SQL][TESTS] Fix the method calls of `QueryTest.checkAnswer` ### What changes were proposed in this pull request? Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen. The actual exceptions are thrown in the trait `QueryTest`. However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers. After this PR: 1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results. 2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message. 3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`. ### Why are the changes needed? We should fix these method calls to perform actual validations in test suites. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26788 from gengliangwang/fixCheckAnswer. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-09 22:19:08 +09:00
Aman Omer	dcea7a4c9a	[SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and bool_or() ### What changes were proposed in this pull request? This PR introduces a method `expressionWithAlias` in class `FunctionRegistry` which is used to register function's constructor. Currently, `expressionWithAlias` is used to register `BoolAnd` & `BoolOr`. ### Why are the changes needed? Error message is wrong when alias name is used for `BoolAnd` & `BoolOr`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually. For query, `select every('true');` Output before this PR, > Error in query: cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7; After this PR, > Error in query: cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7; Closes #26712 from amanomer/29883. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-09 13:23:16 +08:00
Pablo Langa	bca9de6684	[SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands. We don’t have this methods in the catalog to implement an V2 command * catalog.listFunctions ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing `SHOW FUNCTIONS LIKE namespace.function` ### Does this PR introduce any user-facing change? Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-12-08 20:15:09 -08:00
Kent Yao	e88d74052b	[SPARK-30147][SQL] Trim the string when cast string type to booleans ### What changes were proposed in this pull request? Now, we trim the string when casting string value to those `canCast` types values, e.g. int, double, decimal, interval, date, timestamps, except for boolean. This behavior makes type cast and coercion inconsistency in Spark. Not fitting ANSI SQL standard either. ``` If TD is boolean, then Case: a) If SD is character string, then SV is replaced by TRIM ( BOTH ' ' FROM VE ) Case: i) If the rules for literal in Subclause 5.3, “literal”, can be applied to SV to determine a valid value of the data type TD, then let TV be that value. ii) Otherwise, an exception condition is raised: data exception — invalid character value for cast. b) If SD is boolean, then TV is SV ``` In this pull request, we trim all the whitespaces from both ends of the string before converting it to a bool value. This behavior is as same as others, but a bit different from sql standard, which trim only spaces. ### Why are the changes needed? Type cast/coercion consistency ### Does this PR introduce any user-facing change? yes, string with whitespaces in both ends will be trimmed before converted to booleans. e.g. `select cast('\t true' as boolean)` results `true` now, before this pr it's `null` ### How was this patch tested? add unit tests Closes #26776 from yaooqinn/SPARK-30147. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-07 15:03:51 +09:00
Aman Omer	51aa7a920e	[SPARK-30148][SQL] Optimize writing plans if there is an analysis exception ### What changes were proposed in this pull request? Optimized QueryExecution.scala#writePlans(). ### Why are the changes needed? If any query fails in Analysis phase and gets AnalysisException, there is no need to execute further phases since those will return a same result i.e, AnalysisException. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26778 from amanomer/optExplain. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-07 10:58:02 +09:00
Sean Owen	a30ec19a73	[SPARK-30155][SQL] Rename parse() to parseString() to avoid conflict in Scala 2.13 ### What changes were proposed in this pull request? Rename internal method LegacyTypeStringParser.parse() to parseString(). ### Why are the changes needed? In Scala 2.13, the parse() definition clashes with supertype declarations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #26784 from srowen/SPARK-30155. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 16:16:28 -08:00
wuyi	58be82ad4b	[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0. And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`. ### Why are the changes needed? 1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent. 2. User may gets confused in some cases. For example: ``` CREATE TABLE t1 (c1 INT) USING PARQUET; CREATE TABLE t2 (c1 INT); ``` In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case. On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`: ``` CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE; CREATE TABLE t4 AS SELECT 1 AS VALUE; ``` And these two cases together can be really confusing. 3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior. ### Does this PR introduce any user-facing change? Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider. ### How was this patch tested? Added tests in `DDLParserSuite` and `HiveDDlSuite`. Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default. Lead-authored-by: wuyi <yi.wu@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-07 02:15:25 +08:00
Liang-Chi Hsieh	c1a5f94973	[SPARK-30112][SQL] Allow insert overwrite same table if using dynamic partition overwrite ### What changes were proposed in this pull request? This patch proposes to allow insert overwrite same table if using dynamic partition overwrite. ### Why are the changes needed? Currently, Insert overwrite cannot overwrite to same table even it is dynamic partition overwrite. But for dynamic partition overwrite, we do not delete partition directories ahead. We write to staging directories and move data to final partition directories. We should be able to insert overwrite to same table under dynamic partition overwrite. This enables users to read data from a table and insert overwrite to same table by using dynamic partition overwrite. Because this is not allowed for now, users need to write to other temporary location and move it back to the table. ### Does this PR introduce any user-facing change? Yes. Users can insert overwrite same table if using dynamic partition overwrite. ### How was this patch tested? Unit test. Closes #26752 from viirya/dynamic-overwrite-same-table. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 09:22:16 -08:00
gengjiaan	187f3c1773	[SPARK-28083][SQL] Support LIKE ... ESCAPE syntax ## What changes were proposed in this pull request? The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL. For example: ``` select 'abcSpark_13sd' LIKE '%Spark\\_%'; //true select 'abcSpark_13sd' LIKE '%Spark/_%'; //false select 'abcSpark_13sd' LIKE '%Spark"_%'; //false select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; //true select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; //true select 'abcSpark%13sd' LIKE '%Spark\\%%'; //true select 'abcSpark%13sd' LIKE '%Spark/%%'; //false select 'abcSpark%13sd' LIKE '%Spark"%%'; //false select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; //true select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; //true select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; //true select 'abcSpark/13sd' LIKE '%Spark//_%'; //false select 'abcSpark"13sd' LIKE '%Spark""_%'; //false select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; //true select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; //true ``` But Spark SQL only supports 'LIKE predicate'. Note: If the input string or pattern string is null, then the result is null too. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/functions-matching.html Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape MySQL: https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708 https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A ## How was this patch tested? Exists UT and new UT. This PR merged to my production environment and runs above sql: ``` spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%'; true Time taken: 0.119 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%'; false Time taken: 0.103 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%'; false Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; true Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; true Time taken: 0.092 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%'; true Time taken: 0.109 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%'; false Time taken: 0.1 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%'; false Time taken: 0.081 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; true Time taken: 0.095 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; true Time taken: 0.113 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; true Time taken: 0.078 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%'; false Time taken: 0.067 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%'; false Time taken: 0.084 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; true Time taken: 0.091 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; true Time taken: 0.091 seconds, Fetched 1 row(s) ``` I create a table and its schema is: ``` spark-sql> desc formatted gja_test; key string NULL value string NULL other string NULL # Detailed Table Information Database test Table gja_test Owner test Created Time Wed Apr 10 11:06:15 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.1-SNAPSHOT Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1563443838] Statistics 26 bytes Location hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties [field.delim= , serialization.format= ] Partition Provider Catalog Time taken: 0.642 seconds, Fetched 21 row(s) ``` Table `gja_test` exists three rows of data. ``` spark-sql> select * from gja_test; a A ao b B bo "__ """__ " Time taken: 0.665 seconds, Fetched 3 row(s) ``` At finally, I test this function: ``` spark-sql> select * from gja_test where key like value escape '"'; "__ """__ " Time taken: 0.687 seconds, Fetched 1 row(s) ``` Closes #25001 from beliefer/ansi-sql-like. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-06 00:07:38 -08:00
Terry Kim	b86d4bb931	[SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables ### What changes were proposed in this pull request? This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation. ### Why are the changes needed? Currently there are two issues: 1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice). 2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails: ``` scala> sql("create table t using csv as select 1 as i") res2: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show +---+ \| i\| +---+ \| 1\| +---+ scala> sql("select * from spark_catalog.t").show org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14; 'Project [] +- 'UnresolvedRelation [spark_catalog, t] ``` ### Does this PR introduce any user-facing change? Yes. Now the catalog name is resolved correctly: ``` scala> sql("create table t using csv as select 1 as i") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select from t").show +---+ \| i\| +---+ \| 1\| +---+ scala> sql("select * from spark_catalog.t").show +---+ \| i\| +---+ \| 1\| +---+ ``` ### How was this patch tested? Added new tests. Closes #26684 from imback82/resolve_relation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-06 15:45:13 +08:00
madianjun	a5ccbced8c	[SPARK-30067][CORE] Fix fragment offset comparison in getBlockHosts ### What changes were proposed in this pull request? A bug fixed about the code in getBlockHosts() function. In the case "The fragment ends at a position within this block", the end of fragment should be before the end of block，where the "end of block" means `b.getOffset + b.getLength`，not `b.getLength`. ### Why are the changes needed? When comparing the fragment end and the block end，we should use fragment's `offset + length`，and then compare to the block's `b.getOffset + b.getLength`, not the block's length. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? No test. Closes #26650 from mdianjun/fix-getBlockHosts. Authored-by: madianjun <madianjun@jd.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 23:39:49 -08:00
Jungtaek Lim (HeartSaVioR)	25431d79f7	[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink ### What changes were proposed in this pull request? This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup. To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files. ### Why are the changes needed? Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT. Closes #26590 from HeartSaVioR/SPARK-29953. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-12-05 21:46:28 -08:00
Sean Owen	7782b61a31	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748 I told you it'd be iterative! Closes #26765 from srowen/SPARK-29392.3. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 13:48:29 -08:00
Kent Yao	b9cae37750	[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres # What changes were proposed in this pull request? Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy: ```scala /** * For [[Add]]: * 1. if both side are interval, stays the same; * 2. else if one side is interval, turns it to [[TimeAdd]]; * 3. else if one side is date, turns it to [[DateAdd]] ; * 4. else stays the same. * * For [[Subtract]]: * 1. if both side are interval, stays the same; * 2. else if the right side is an interval, turns it to [[TimeSub]]; * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]]; * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]]; * 5. else if the left side is date, turns it to [[DateSub]]; * 6. else turns it to stays the same. * * For [[Multiply]]: * 1. If one side is interval, turns it to [[MultiplyInterval]]; * 2. otherwise, stays the same. * * For [[Divide]]: * 1. If the left side is interval, turns it to [[DivideInterval]]; * 2. otherwise, stays the same. */ ``` Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule. ### Why are the changes needed? Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark. ### Does this PR introduce any user-facing change? 1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results. ### How was this patch tested? add ut Closes #26412 from yaooqinn/SPARK-29774. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 22:03:44 +08:00
Kent Yao	332e252a14	[SPARK-29425][SQL] The ownership of a database should be respected ### What changes were proposed in this pull request? Keep the owner of a database when executing alter database commands ### Why are the changes needed? Spark will inadvertently delete the owner of a database for executing databases ddls ### Does this PR introduce any user-facing change? NO ### How was this patch tested? add and modify uts Closes #26080 from yaooqinn/SPARK-29425. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 16:14:27 +08:00
turbofei	0ab922c1eb	[SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery ### What changes were proposed in this pull request? There is an issue for InSubquery expression. For example, there are two tables `ta` and `tb` created by the below statements. ``` sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") ``` This statement below would thrown dataType mismatch exception. ``` sql("select * from ta where id in (select id from tb)").show() ``` However, this similar statement could execute successfully. ``` sql("select * from ta where id in ((select id from tb))").show() ``` The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression. Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt. In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved. ### Why are the changes needed? Some InSubquery would throw dataType mismatch exception. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #26485 from turboFei/SPARK-29860-in-subquery. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 16:00:16 +08:00
Aman Omer	0bd8b995d6	[SPARK-30093][SQL] Improve error message for creating view ### What changes were proposed in this pull request? Improved error message while creating views. ### Why are the changes needed? Error message should suggest user to use TEMPORARY keyword while creating permanent view referred by temporary view. https://github.com/apache/spark/pull/26317#discussion_r352377363 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated test case. Closes #26731 from amanomer/imp_err_msg. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 15:28:07 +08:00
Sean Owen	2ceed6f32c	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings ### What changes were proposed in this pull request? Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent. ### Why are the changes needed? Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13. The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings. While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`. ### Does this PR introduce any user-facing change? Should not change behavior. ### How was this patch tested? Existing tests. Closes #26748 from srowen/SPARK-29392.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-04 15:03:26 -08:00
07ARB	a2102c81ee	[SPARK-29453][WEBUI] Improve tooltips information for SQL tab ### What changes were proposed in this pull request? Adding tooltip to SQL tab for better usability. ### Why are the changes needed? There are a few common points of confusion in the UI that could be clarified with tooltips. We should add tooltips to explain. ### Does this PR introduce any user-facing change? yes. ![Screenshot 2019-11-23 at 9 47 41 AM](https://user-images.githubusercontent.com/8948111/69472963-aaec5980-0dd6-11ea-881a-fe6266171054.png) ### How was this patch tested? Manual test. Closes #26641 from 07ARB/SPARK-29453. Authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-12-04 12:33:43 -06:00
Aman Omer	55132ae9c9	[SPARK-30099][SQL] Improve Analyzed Logical Plan ### What changes were proposed in this pull request? Avoid duplicate error message in Analyzed Logical plan. ### Why are the changes needed? Currently, when any query throws `AnalysisException`, same error message will be repeated because of following code segment. `04a5b8f5f8/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (L157-L166)` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Result of `explain extended select * from wrong;` BEFORE > == Parsed Logical Plan == > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > AFTER > == Parsed Logical Plan == > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > Closes #26734 from amanomer/cor_APlan. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-04 13:51:40 +08:00
xiaodeshan	196ea936c3	[SPARK-30106][SQL][TEST] Fix the test of DynamicPartitionPruningSuite ### What changes were proposed in this pull request? Changed the test DPP triggers only for certain types of query in DynamicPartitionPruningSuite. ### Why are the changes needed? The sql has no partition key. The description "no predicate on the dimension table" is not right. So fix it. ``` Given("no predicate on the dimension table") withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") { val df = sql( """ \|SELECT * FROM fact_sk f \|JOIN dim_store s \|ON f.date_id = s.store_id """.stripMargin) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated UT Closes #26744 from deshanxiao/30106. Authored-by: xiaodeshan <xiaodeshan@xiaomi.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-03 14:27:48 -08:00
Sean Owen	4193d2f4cc	[SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13 ### What changes were proposed in this pull request? Move some classes extending Scala collections into parallel source trees, to support 2.13; other minor collection-related modifications. Modify some classes extending Scala collections to work with 2.13 as well as 2.12. In many cases, this means introducing parallel source trees, as the type hierarchy changed in ways that one class can't support both. ### Why are the changes needed? To support building for Scala 2.13 in the future. ### Does this PR introduce any user-facing change? There should be no behavior change. ### How was this patch tested? Existing tests. Note that the 2.13 changes are not tested by the PR builder, of course. They compile in 2.13 but can't even be tested locally. Later, once the project can be compiled for 2.13, thus tested, it's possible the 2.13 implementations will need updates. Closes #26728 from srowen/SPARK-30012. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-03 08:59:43 -08:00
John Ayad	8c2849a695	[SPARK-30082][SQL] Do not replace Zeros when replacing NaNs ### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 0\| \| 0.0\| 3\| \| NaN\| 0\| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 2\| \| 0.0\| 3\| \| 2.0\| 2\| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 0\| \| 0.0\| 3\| \| NaN\| 0\| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 0\| \| 0.0\| 3\| \| 2.0\| 0\| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes #26738 from johnhany97/SPARK-30082. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John Ayad <jayad@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-04 00:04:55 +08:00
Kent Yao	65552a81d1	[SPARK-30083][SQL] visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking ### What changes were proposed in this pull request? `UnaryPositive` only accepts numeric and interval as we defined, but what we do for this in `AstBuider.visitArithmeticUnary` is just bypassing it. This should not be omitted for the type checking requirement. ### Why are the changes needed? bug fix, you can find a pre-discussion here https://github.com/apache/spark/pull/26578#discussion_r347350398 ### Does this PR introduce any user-facing change? yes, +non-numeric-or-interval is now invalid. ``` -- !query 14 select +date '1900-01-01' -- !query 14 schema struct<DATE '1900-01-01':date> -- !query 14 output 1900-01-01 -- !query 15 select +timestamp '1900-01-01' -- !query 15 schema struct<TIMESTAMP '1900-01-01 00:00:00':timestamp> -- !query 15 output 1900-01-01 00:00:00 -- !query 16 select +map(1, 2) -- !query 16 schema struct<map(1, 2):map<int,int>> -- !query 16 output {1:2} -- !query 17 select +array(1,2) -- !query 17 schema struct<array(1, 2):array<int>> -- !query 17 output [1,2] -- !query 18 select -'1' -- !query 18 schema struct<(- CAST(1 AS DOUBLE)):double> -- !query 18 output -1.0 -- !query 19 select -X'1' -- !query 19 schema struct<> -- !query 19 output org.apache.spark.sql.AnalysisException cannot resolve '(- X'01')' due to data type mismatch: argument 1 requires (numeric or interval) type, however, 'X'01'' is of binary type.; line 1 pos 7 -- !query 20 select +X'1' -- !query 20 schema struct<X'01':binary> -- !query 20 output ``` ### How was this patch tested? add ut check Closes #26716 from yaooqinn/SPARK-30083. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-03 23:42:21 +08:00
Kent Yao	39291cff95	[SPARK-30048][SQL] Enable aggregates with interval type values for RelationalGroupedDataset ### What changes were proposed in this pull request? Now the min/max/sum/avg are support for intervals, we should also enable it in RelationalGroupedDataset ### Why are the changes needed? API consistency improvement ### Does this PR introduce any user-facing change? yes, Dataset support min/max/sum/avg(mean) on intervals ### How was this patch tested? add ut Closes #26681 from yaooqinn/SPARK-30048. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-03 18:40:14 +08:00
herman	d7b268ab32	[SPARK-29348][SQL] Add observable Metrics for Streaming queries ### What changes were proposed in this pull request? Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point. A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach: - Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map. - (Micro-batch) Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. Please note that we currently do not support continuous execution streaming. ### Why are the changes needed? This enabled observable metrics. ### Does this PR introduce any user-facing change? Yes. It adds the `observe` method to `Dataset`. ### How was this patch tested? - Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`. - Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`. - Added integration tests for streaming to the `StreamingQueryListenerSuite`. - Added integration tests for batch to the `DataFrameCallbackSuite`. Closes #26127 from hvanhovell/SPARK-29348. Authored-by: herman <herman@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-12-03 11:25:49 +01:00
wuyi	075ae1eeaf	[SPARK-29537][SQL] throw exception when user defined a wrong base path ### What changes were proposed in this pull request? When user defined a base path which is not an ancestor directory for all the input paths, throw exception immediately. ### Why are the changes needed? Assuming that we have a DataFrame[c1, c2] be written out in parquet and partitioned by c1. When using `spark.read.parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c2 only. But if we use `spark.read.option("basePath", "/path/from").parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c1 and c2. This's happens because a wrong base path does not actually work in `parsePartition()`, so paring would continue until it reaches a directory without "=". And I think the result of the second read way doesn't make sense. ### Does this PR introduce any user-facing change? Yes, with this change, user would hit `IllegalArgumentException ` when given a wrong base path while previous behavior doesn't. ### How was this patch tested? Added UT. Closes #26195 from Ngone51/dev-wrong-basePath. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-03 17:02:50 +08:00
Ali Afroozeh	68034a8056	[SPARK-30072][SQL] Create dedicated planner for subqueries ### What changes were proposed in this pull request? This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent. As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes: ``` promote_precision(1.7) * promote_precision(avg(a)) ``` After the optimization, more specifically the constant folding rule, this expression becomes: ``` 1.7 * promote_precision(avg(a)) ``` Now if we run the analyzer on this optimized query again, we will get: ``` promote_precision(1.7) * promote_precision(promote_precision(avg(a))) ``` Which will later optimized as: ``` 1.7 * promote_precision(promote_precision(avg(a))) ``` As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan. We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced. ### How was this patch tested? Unit tests Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-12-02 20:56:40 +01:00
Jungtaek Lim (HeartSaVioR)	54edaee586	[MINOR][SS] Add implementation note on overriding serialize/deserialize in HDFSMetadataLog methods' scaladoc ### What changes were proposed in this pull request? The patch adds scaladoc on `HDFSMetadataLog.serialize` and `HDFSMetadataLog.deserialize` for adding implementation note when overriding - HDFSMetadataLog calls `serialize` and `deserialize` inside try-finally and caller will do the resource (input stream, output stream) cleanup, so resource cleanup should not be performed in these methods, but there's no note on this (only code comment, not scaladoc) which is easy to be missed. ### Why are the changes needed? Contributors who are unfamiliar with the intention seem to think it as a bug if the resource is not cleaned up in serialize/deserialize of subclass of HDFSMetadataLog, and they couldn't know about the intention without reading the code of HDFSMetadataLog. Adding the note as scaladoc would expand the visibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Just a doc change. Closes #26732 from HeartSaVioR/MINOR-SS-HDFSMetadataLog-serde-scaladoc. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: dz <953396112@qq.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-12-02 09:01:45 -06:00
Wenchen Fan	e271664a01	[MINOR][SQL] Rename config name to spark.sql.analyzer.failAmbiguousSelfJoin.enabled ### What changes were proposed in this pull request? add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`. ### Why are the changes needed? to follow the existing naming style ### Does this PR introduce any user-facing change? no ### How was this patch tested? not needed Closes #26694 from cloud-fan/conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 21:05:06 +08:00
Kent Yao	4e073f3c50	[SPARK-30047][SQL] Support interval types in UnsafeRow ### What changes were proposed in this pull request? Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance. ### Why are the changes needed? improve aggerates ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut and existing ones Closes #26680 from yaooqinn/SPARK-30047. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 20:47:23 +08:00
LantaoJin	04a5b8f5f8	[SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE ### What changes were proposed in this pull request? In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`. Hive support `STORED AS` new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`. ### Why are the changes needed? See https://github.com/apache/spark/pull/26097#issue-327424759 ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat]; ### How was this patch tested? Add UTs. Closes #26466 from LantaoJin/SPARK-29839. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 16:11:58 +08:00
Yuanjian Li	d1465a1b0d	[SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey reducePostShufflePartitions enabled ### What changes were proposed in this pull request? 1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config. 2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`. ### Why are the changes needed? Make the relation between these confs clearer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26664 from xuanyuanking/SPARK-9853-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 12:37:06 +08:00
Terry Kim	5a1896adcb	[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root \|-- col1: string (nullable = true) \|-- col2: string (nullable = true) \|-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ \|col1\|col2\|col2\| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #26700 from imback82/na_drop. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 12:25:28 +08:00
wuyi	87ebfaf003	[SPARK-29956][SQL] A literal number with an exponent should be parsed to Double ### What changes were proposed in this pull request? For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior. ### Why are the changes needed? According to ANSI standard of SQL, we see that the (part of) definition of `literal` : ``` <approximate numeric literal> ::= <mantissa> E <exponent> ``` which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal). And when we test Presto, we found that Presto also conforms to this standard: ``` presto:default> select typeof(1E2); _col0 -------- double (1 row) ``` ``` presto:default> select typeof(1.2); _col0 -------------- decimal(2,1) (1 row) ``` We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to The difference between the two confuses most users as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment). Although, we also see that PostgreSQL has its own implementation: ``` postgres=# select pg_typeof(1E2); pg_typeof ----------- numeric (1 row) postgres=# select pg_typeof(1.2); pg_typeof ----------- numeric (1 row) ``` We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience. ### Does this PR introduce any user-facing change? Yes. For `1E2`, before this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)] ``` After this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [100.0: double] ``` And for `1E-45`, before this PR: ``` org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38 == SQL == select 1E-45 at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided ``` after this PR: ``` scala> spark.sql("select 1E-45"); res1: org.apache.spark.sql.DataFrame = [1.0E-45: double] ``` And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well. ### How was this patch tested? updated `literals.sql.out` and `ansi/literals.sql.out` Closes #26595 from Ngone51/SPARK-29956. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 11:34:56 +08:00
Yuming Wang	708ab57f37	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #26697 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-02 09:02:39 +09:00
shahid	b182ed83f6	[SPARK-29724][SPARK-29726][WEBUI][SQL] Support JDBC/ODBC tab for HistoryServer WebUI ### What changes were proposed in this pull request? Support JDBC/ODBC tab for HistoryServer WebUI. Currently from Historyserver we can't access the JDBC/ODBC tab for thrift server applications. In this PR, I am doing 2 main changes 1. Refactor existing thrift server listener to support kvstore 2. Add history server plugin for thrift server listener and tab. ### Why are the changes needed? Users can access Thriftserver tab from History server for both running and finished applications, ### Does this PR introduce any user-facing change? Support for JDBC/ODBC tab for the WEBUI from History server ### How was this patch tested? Add UT and Manual tests 1. Start Thriftserver and Historyserver ``` sbin/stop-thriftserver.sh sbin/stop-historyserver.sh sbin/start-thriftserver.sh sbin/start-historyserver.sh ``` 2. Launch beeline `bin/beeline -u jdbc:hive2://localhost:10000` 3. Run queries Go to the JDBC/ODBC page of the WebUI from History server ![image](https://user-images.githubusercontent.com/23054875/68365501-cf013700-0156-11ea-84b4-fda8008c92c4.png) Closes #26378 from shahidki31/ThriftKVStore. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-11-29 19:44:31 -08:00
Dongjoon Hyun	9cd174a7c9	Revert "[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column" This reverts commit `19af1fe3a2`.	2019-11-27 11:07:08 -08:00
fuwhu	16da714ea5	[SPARK-29979][SQL][FOLLOW-UP] improve the output of DesribeTableExec ### What changes were proposed in this pull request? refine the output of "DESC TABLE" command. After this PR, the output of "DESC TABLE" command is like below : ``` id bigint data string # Partitioning Part 0 id # Detailed Table Information Name testca.table_name Comment this is a test table Location /tmp/testcat/table_name Provider foo Table Properties [bar=baz] ``` ### Why are the changes needed? Currently, "DESC TABLE" will show reserved properties (eg. location, comment) in the "Table Property" section. Since reserved properties are different from common properties, displaying reserved properties together with other table detailed information and displaying other properties in single field should be reasonable, and it is consistent with hive and DescribeTableCommand action. ### Does this PR introduce any user-facing change? yes, the output of "DESC TABLE" command is refined as above. ### How was this patch tested? Update existing unit tests. Closes #26677 from fuwhu/SPARK-29979-FOLLOWUP-1. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 23:16:53 +08:00
Yuming Wang	19af1fe3a2	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #25214 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-27 18:13:33 +09:00
wuyi	a58d91b159	[SPARK-29768][SQL] Column pruning through nondeterministic expressions ### What changes were proposed in this pull request? Support columnar pruning through non-deterministic expressions. ### Why are the changes needed? In some cases, columns can still be pruned even though nondeterministic expressions appears. e.g. for the plan `Filter('a = 1, Project(Seq('a, rand() as 'r), LogicalRelation('a, 'b)))`, we shall still prune column b while non-deterministic expression appears. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a new test file: `ScanOperationSuite`. Added test in `FileSourceStrategySuite` to verify the right prune behavior for both DS v1 and v2. Closes #26629 from Ngone51/SPARK-29768. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 15:37:01 +08:00
Kent Yao	4fd585d2c5	[SPARK-30008][SQL] The dataType of collect_list/collect_set aggs should be ArrayType(_, false) ### What changes were proposed in this pull request? ```scala // Do not allow null values. We follow the semantics of Hive's collect_list/collect_set here. // See: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMkCollectionEvaluator ``` These two functions do not allow null values as they are defined, so their elements should not contain null. ### Why are the changes needed? Casting collect_list(a) to ArrayType(_, false) fails before this fix. ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #26651 from yaooqinn/SPARK-30008. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-26 20:40:21 -08:00
Jungtaek Lim (HeartSaVioR)	5b628f8b17	Revert "[SPARK-26081][SPARK-29999]" ### What changes were proposed in this pull request? This reverts commit `31c4fab` (#23052) to make sure the partition calling `ManifestFileCommitProtocol.newTaskTempFile` creates actual file. This also reverts part of commit `0d3d46d` (#26639) since the commit fixes the issue raised from `31c4fab` and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2. ### Why are the changes needed? After the changes in SPARK-26081 (commit `31c4fab` / #23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws `FileNotFoundException` since the file is not created. SPARK-29999 (commit `0d3d46d` / #26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Jenkins build will follow. Closes #26671 from HeartSaVioR/revert-SPARK-26081-SPARK-29999. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-11-26 18:36:08 -08:00
Kent Yao	ed0c33fdd4	[SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string ### What changes were proposed in this pull request? We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do. This PR makes the whitespaces handling be consistent for nterval values. Typed interval literal, multi-unit representation, and casting from strings are all supported. ```sql postgres=# select interval E'1 \t day'; interval ---------- 1 day (1 row) postgres=# select interval E'1\t' day; interval ---------- 1 day (1 row) ``` ### Why are the changes needed? Whitespace handling should be consistent for interval value, and across different types in Spark. PostgreSQL feature parity. ### Does this PR introduce any user-facing change? Yes, the interval string of multi-units values which separated by whitespaces can be valid now. ### How was this patch tested? add ut. Closes #26662 from yaooqinn/SPARK-30026. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 01:20:38 +08:00
Huaxin Gao	373c2c3f44	[SPARK-29862][SQL] CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add CreateViewStatement and make CREARE VIEW go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC v // success and describe the view v from my_catalog CREATE VIEW v AS SELECT 1 // report view not found as there is no view v in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running CREATE VIEW ... Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog. ### How was this patch tested? unit tests Closes #26649 from huaxingao/spark-29862. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-26 14:10:46 +08:00
Dongjoon Hyun	1466863cee	[SPARK-30015][BUILD] Move hive-storage-api dependency from `hive-2.3` to `sql/core` # What changes were proposed in this pull request? This PR aims to relocate the following internal dependencies to compile `sql/core` without `-Phive-2.3` profile. 1. Move the `hive-storage-api` to `sql/core` which is using `hive-storage-api` really. BEFORE (sql/core compilation) ``` $ ./build/mvn -DskipTests --pl sql/core --am compile ... [ERROR] [Error] /Users/dongjoon/APACHE/spark/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala:21: object hive is not a member of package org.apache.hadoop ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ ``` AFTER (sql/core compilation) ``` $ ./build/mvn -DskipTests --pl sql/core --am compile ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:04 min [INFO] Finished at: 2019-11-25T00:20:11-08:00 [INFO] ------------------------------------------------------------------------ ``` 2. For (1), add `commons-lang:commons-lang` test dependency to `spark-core` module to manage the dependency explicitly. Without this, `core` module fails to build the test classes. ``` $ ./build/mvn -DskipTests --pl core --am package -Phadoop-3.2 ... [INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) spark-core_2.12 --- [INFO] Using incremental compilation using Mixed compile order [INFO] Compiler bridge file: /Users/dongjoon/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar [INFO] Compiling 271 Scala sources and 26 Java sources to /spark/core/target/scala-2.12/test-classes ... [ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23: object lang is not a member of package org.apache.commons [ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49: not found: value SerializationUtils [ERROR] two errors found ``` BEFORE (commons-lang:commons-lang) The following is the previous `core` module's `commons-lang:commons-lang` dependency. 1. branch-2.4 ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-core_2.11 --- [INFO] org.apache.spark:spark-core_2.11🫙2.4.5-SNAPSHOT [INFO] \- org.spark-project.hive:hive-exec:jar:1.2.1.spark2:provided [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` 2. v3.0.0-preview (-Phadoop-3.2) ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang -Phadoop-3.2 [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12🫙3.0.0-preview [INFO] \- org.apache.hive:hive-storage-api:jar:2.6.0:compile [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` 3. v3.0.0-preview(default) ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12🫙3.0.0-preview [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.4:compile [INFO] \- org.apache.hadoop:hadoop-common:jar:2.7.4:compile [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` AFTER (commons-lang:commons-lang) ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12🫙3.0.0-SNAPSHOT [INFO] \- commons-lang:commons-lang:jar:2.6:test ``` Since we wanted to verify that this PR doesn't change `hive-1.2` profile, we merged [SPARK-30005 Update `test-dependencies.sh` to check `hive-1.2/2.3` profile](`a1706e2fa7`) before this PR. ### Why are the changes needed? - Apache Spark 2.4's `sql/core` is using `Apache ORC (nohive)` jars including shaded `hive-storage-api` to access ORC data sources. - Apache Spark 3.0's `sql/core` is using `Apache Hive` jars directly. Previously, `-Phadoop-3.2` hid this `hive-storage-api` dependency. Now, we are using `-Phive-2.3` instead. As I mentioned [previously](https://github.com/apache/spark/pull/26619#issuecomment-556926064), this PR is required to compile `sql/core` module without `-Phive-2.3`. - For `sql/hive` and `sql/hive-thriftserver`, it's natural that we need `-Phive-1.2` or `-Phive-2.3`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This will pass the Jenkins (with the dependency check and unit tests). We need to check manually with `./build/mvn -DskipTests --pl sql/core --am compile`. This closes #26657 . Closes #26658 from dongjoon-hyun/SPARK-30015. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-25 10:54:14 -08:00
fuwhu	29ebd9336c	[SPARK-29979][SQL] Add basic/reserved property key constants in TableCatalog and SupportsNamespaces ### What changes were proposed in this pull request? Add "comment" and "location" property key constants in TableCatalog and SupportNamespaces. And update code of implementation classes to use these constants instead of hard code. ### Why are the changes needed? Currently, some basic/reserved keys (eg. "location", "comment") of table and namespace properties are hard coded or defined in specific logical plan implementation class. These keys can be centralized in TableCatalog and SupportsNamespaces interface and shared across different implementation classes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Existing unit test Closes #26617 from fuwhu/SPARK-29979. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-26 01:24:43 +08:00
Terry Kim	f09c1a36c4	[SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root \|-- col1: string (nullable = true) \|-- col2: string (nullable = true) \|-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ \|col1\| col2\| col2\| +----+-----+-----+ \| 1\|hello\| 2\| \| 3\| 4\|hello\| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #26593 from imback82/na_fill. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-26 00:06:19 +08:00
Wenchen Fan	bd9ce83063	[SPARK-29975][SQL][FOLLOWUP] document --CONFIG_DIM ### What changes were proposed in this pull request? add document to address https://github.com/apache/spark/pull/26612#discussion_r349844327 ### Why are the changes needed? help people understand how to use --CONFIG_DIM ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #26661 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-25 20:45:31 +09:00
Kent Yao	de21f28f8a	[SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces ### What changes were proposed in this pull request? A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean. ### Why are the changes needed? improve the white spaces handling in UTF8String, also with some bugs fixed ### Does this PR introduce any user-facing change? yes, string with `control character` at either end can be convert to date/timestamp and interval now ### How was this patch tested? add ut Closes #26626 from yaooqinn/SPARK-29986. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-25 14:37:04 +08:00
Kent Yao	5cf475d288	[SPARK-30000][SQL] Trim the string when cast string type to decimals ### What changes were proposed in this pull request? https://bugs.openjdk.java.net/browse/JDK-8170259 https://bugs.openjdk.java.net/browse/JDK-8170563 When we cast string type to decimal type, we rely on java.math. BigDecimal. It can't accept leading and training spaces, as you can see in the above links. This behavior is not consistent with other numeric types now. we need to fix it and keep consistency. ### Why are the changes needed? make string to numeric types be consistent ### Does this PR introduce any user-facing change? yes, string removed trailing or leading white spaces will be able to convert to decimal if the trimmed is valid ### How was this patch tested? 1. modify ut #### Benchmark ```scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. / package org.apache.spark.sql.execution.benchmark import org.apache.spark.benchmark.Benchmark /* * Benchmark trim the string when casting string type to Boolean/Numeric types. * To run this benchmark: * {{{ * 1. without sbt: * bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> * 2. build/sbt "sql/test:runMain <this class>" * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" * Results will be written to "benchmarks/CastBenchmark-results.txt". * }}} / object CastBenchmark extends SqlBasedBenchmark { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val title = "Cast String to Integral" runBenchmark(title) { withTempPath { dir => val N = 500L << 14 val df = spark.range(N) val types = Seq("decimal") (1 to 5).by(2).foreach { i => df.selectExpr(s"concat(id, '${" " i}') as str") .write.mode("overwrite").parquet(dir + i.toString) } val benchmark = new Benchmark(title, N, minNumIters = 5, output = output) Seq(true, false).foreach { trim => types.foreach { t => val str = if (trim) "trim(str)" else "str" val expr = s"cast($str as $t) as c_$t" (1 to 5).by(2).foreach { i => benchmark.addCase(expr + s" - with $i spaces") { _ => spark.read.parquet(dir + i.toString).selectExpr(expr).collect() } } } } benchmark.run() } } } } ``` #### string trim vs not trim ```java [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 [info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz [info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3362 5486 NaN 2.4 410.4 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3251 5655 NaN 2.5 396.8 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3208 5725 NaN 2.6 391.7 1.0X [info] cast(str as decimal) as c_decimal - with 1 spaces 13962 16233 1354 0.6 1704.3 0.2X [info] cast(str as decimal) as c_decimal - with 3 spaces 14273 14444 179 0.6 1742.4 0.2X [info] cast(str as decimal) as c_decimal - with 5 spaces 14318 14535 125 0.6 1747.8 0.2X ``` #### string trim vs this fix ```java [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 [info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz [info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3265 6299 NaN 2.5 398.6 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3183 6241 693 2.6 388.5 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3167 5923 1151 2.6 386.7 1.0X [info] cast(str as decimal) as c_decimal - with 1 spaces 3161 5838 1126 2.6 385.9 1.0X [info] cast(str as decimal) as c_decimal - with 3 spaces 3046 3457 837 2.7 371.8 1.1X [info] cast(str as decimal) as c_decimal - with 5 spaces 3053 4445 NaN 2.7 372.7 1.1X [info] ``` Closes #26640 from yaooqinn/SPARK-30000. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-25 12:47:07 +08:00
ulysses	a8d907ce94	[SPARK-29937][SQL] Make FileSourceScanExec class fields lazy ### What changes were proposed in this pull request? Since JIRA SPARK-28346,PR [25111](https://github.com/apache/spark/pull/25111), QueryExecution will copy all node stage-by-stage. This make all node instance twice almost. So we should make all class fields lazy to avoid create more unexpected object. ### Why are the changes needed? Avoid create more unexpected object. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #26565 from ulysses-you/make-val-lazy. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-24 16:32:09 -08:00
Jungtaek Lim (HeartSaVioR)	0d3d46db21	[SPARK-29999][SS] Handle FileStreamSink metadata correctly for empty partition ### What changes were proposed in this pull request? This patch checks the existence of output file for each task while committing the task, so that it doesn't throw FileNotFoundException while creating SinkFileStatus. The check is newly required for DSv2 implementation of FileStreamSink, as it is changed to create the output file lazily (as an improvement). JSON writer for example: `9ec2a4e58c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonOutputWriter.scala (L49-L60)` ### Why are the changes needed? Without this patch, FileStreamSink throws FileNotFoundException when writing empty partition. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #26639 from HeartSaVioR/SPARK-29999. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-24 15:31:06 -08:00
Takeshi Yamamuro	3f3a18fff1	[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation ### What changes were proposed in this pull request? This pr proposes a new independent config so that `LogicalRelation` could use `rowCount` to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable `StarSchemaDetection.reorderStarJoins` because we need to turn off CBO to enable it but `StarSchemaDetection` internally references the `rowCount` that is used in LogicalRelation if CBO disabled. ### Why are the changes needed? Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `DataFrameJoinSuite`. Closes #21668 from maropu/PlanStatsConf. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-24 08:30:24 -08:00
uncleGen	3d740901d6	[SPARK-29973][SS] Make `processedRowsPerSecond` calculated more accurately and meaningfully ### What changes were proposed in this pull request? Give `processingTimeSec` 0.001 when a micro-batch completed under 1ms. ### Why are the changes needed? The `processingTimeSec` of batch may be less than 1 ms. As `processingTimeSec` is calculated in ms, so `processingTimeSec` equals 0L. If there is no data in this batch, the `processedRowsPerSecond` equals `0/0.0d`, i.e. `Double.NaN`. If there are some data in this batch, the `processedRowsPerSecond` equals `N/0.0d`, i.e. `Double.Infinity`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new UT Closes #26610 from uncleGen/SPARK-29973. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-24 08:08:15 -06:00
Dongjoon Hyun	6625b69027	[SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short ### What changes were proposed in this pull request? This is a follow-up according to liancheng 's advice. - https://github.com/apache/spark/pull/26619#discussion_r349326090 ### Why are the changes needed? Previously, we chose the full version to be carefully. As of today, it seems that `Apache Hive 2.3` branch seems to become stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the compile combination on GitHub Action. 1. hadoop-2.7/hive-1.2/JDK8 2. hadoop-2.7/hive-2.3/JDK8 3. hadoop-3.2/hive-2.3/JDK8 4. hadoop-3.2/hive-2.3/JDK11 Also, pass the Jenkins with `hadoop-2.7` and `hadoop-3.2` for (1) and (4). (2) and (3) is not ready in Jenkins. Closes #26645 from dongjoon-hyun/SPARK-RENAME-HIVE-DIRECTORY. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-23 12:50:50 -08:00
Liang-Chi Hsieh	6b0e391aa4	[SPARK-29427][SQL] Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset ### What changes were proposed in this pull request? This PR proposes to add `as` API to RelationalGroupedDataset. It creates KeyValueGroupedDataset instance using given grouping expressions, instead of a typed function in groupByKey API. Because it can leverage existing columns, it can use existing data partition, if any, when doing operations like cogroup. ### Why are the changes needed? Currently if users want to do cogroup on DataFrames, there is no good way to do except for KeyValueGroupedDataset. 1. KeyValueGroupedDataset ignores existing data partition if any. That is a problem. 2. groupByKey calls typed function to create additional keys. You can not reuse existing columns, if you just need grouping by them. ```scala // df1 and df2 are certainly partitioned and sorted. val df1 = Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c") .repartition($"a").sortWithinPartitions("a") val df2 = Seq((1, 2, 4), (2, 3, 5)).toDF("a", "b", "c") .repartition($"a").sortWithinPartitions("a") ``` ```scala // This groupBy.as.cogroup won't unnecessarily repartition the data val df3 = df1.groupBy("a").as[Int] .cogroup(df2.groupBy("a").as[Int]) { case (key, data1, data2) => data1.zip(data2).map { p => p._1.getInt(2) + p._2.getInt(2) } } ``` ``` == Physical Plan == (5) SerializeFromObject [input[0, int, false] AS value#11247] +- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4922/12067092816eec1b6f, a#11209: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [a#11209], [a#11225], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11246: int :- (2) Sort [a#11209 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#11209, 5), false, [id=#10218] : +- (1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211] : +- (1) LocalTableScan [_1#11202, _2#11203, _3#11204] +- (4) Sort [a#11225 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#11225, 5), false, [id=#10223] +- (3) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227] +- (3) LocalTableScan [_1#11218, _2#11219, _3#11220] ``` ```scala // Current approach creates additional AppendColumns and repartition data again val df4 = df1.groupByKey(r => r.getInt(0)).cogroup(df2.groupByKey(r => r.getInt(0))) { case (key, data1, data2) => data1.zip(data2).map { p => p._1.getInt(2) + p._2.getInt(2) } } ``` ``` == Physical Plan == (7) SerializeFromObject [input[0, int, false] AS value#11257] +- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4933/138102700737171997, value#11252: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [value#11252], [value#11254], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11256: int :- (3) Sort [value#11252 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(value#11252, 5), true, [id=#10302] : +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4930/19529195347ce07f47, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11252] : +- (2) Sort [a#11209 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#11209, 5), false, [id=#10297] : +- (1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211] : +- (1) LocalTableScan [_1#11202, _2#11203, _3#11204] +- (6) Sort [value#11254 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(value#11254, 5), true, [id=#10312] +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4932/15265288491f0e0c1f, createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11254] +- (5) Sort [a#11225 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#11225, 5), false, [id=#10307] +- (4) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227] +- (4) LocalTableScan [_1#11218, _2#11219, _3#11220] ``` ### Does this PR introduce any user-facing change? Yes, this adds a new `as` API to RelationalGroupedDataset. Users can use it to create KeyValueGroupedDataset and do cogroup. ### How was this patch tested? Unit tests. Closes #26509 from viirya/SPARK-29427-2. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-22 10:34:26 -08:00
Wenchen Fan	6e581cf164	[SPARK-29893][SQL][FOLLOWUP] code cleanup for local shuffle reader ### What changes were proposed in this pull request? A few cleanups for https://github.com/apache/spark/pull/26516: 1. move the calculating of partition start indices from the RDD to the rule. We can reuse code from "shrink number of reducers" in the future if we split partitions by size. 2. only check extra shuffles when adding local readers to the probe side. 3. add comments. 4. simplify the config name: `optimizedLocalShuffleReader` -> `localShuffleReader` ### Why are the changes needed? make code more maintainable. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26625 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-11-22 10:26:54 -08:00
Kent Yao	2dd6807e42	[SPARK-28023][SQL] Add trim logic in UTF8String's toInt/toLong to make it consistent with other string-numeric casting ### What changes were proposed in this pull request? Modify `UTF8String.toInt/toLong` to support trim spaces for both sides before converting it to byte/short/int/long. With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from https://github.com/apache/spark/pull/24872#issuecomment-556917834 ### Why are the changes needed? make the behavior consistent. ### Does this PR introduce any user-facing change? yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double. ### How was this patch tested? 1. add ut. 2. benchmark tests the benchmark is modified based on https://github.com/apache/spark/pull/24872#issuecomment-503827016 ```scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. / package org.apache.spark.sql.execution.benchmark import org.apache.spark.benchmark.Benchmark /* * Benchmark trim the string when casting string type to Boolean/Numeric types. * To run this benchmark: * {{{ * 1. without sbt: * bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> * 2. build/sbt "sql/test:runMain <this class>" * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" * Results will be written to "benchmarks/CastBenchmark-results.txt". * }}} / object CastBenchmark extends SqlBasedBenchmark { This conversation was marked as resolved by yaooqinn override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val title = "Cast String to Integral" runBenchmark(title) { withTempPath { dir => val N = 500L << 14 val df = spark.range(N) val types = Seq("int", "long") (1 to 5).by(2).foreach { i => df.selectExpr(s"concat(id, '${" " i}') as str") .write.mode("overwrite").parquet(dir + i.toString) } val benchmark = new Benchmark(title, N, minNumIters = 5, output = output) Seq(true, false).foreach { trim => types.foreach { t => val str = if (trim) "trim(str)" else "str" val expr = s"cast($str as $t) as c_$t" (1 to 5).by(2).foreach { i => benchmark.addCase(expr + s" - with $i spaces") { _ => spark.read.parquet(dir + i.toString).selectExpr(expr).collect() } } } } benchmark.run() } } } } ``` #### benchmark result. normal trim v.s. trim in toInt/toLong ```java ================================================================================================ Cast String to Integral ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 Intel(R) Core(TM) i5-5287U CPU 2.90GHz Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast(trim(str) as int) as c_int - with 1 spaces 10220 12994 1337 0.8 1247.5 1.0X cast(trim(str) as int) as c_int - with 3 spaces 4763 8356 357 1.7 581.4 2.1X cast(trim(str) as int) as c_int - with 5 spaces 4791 8042 NaN 1.7 584.9 2.1X cast(trim(str) as long) as c_long - with 1 spaces 4014 6755 NaN 2.0 490.0 2.5X cast(trim(str) as long) as c_long - with 3 spaces 4737 6938 NaN 1.7 578.2 2.2X cast(trim(str) as long) as c_long - with 5 spaces 4478 6919 1404 1.8 546.6 2.3X cast(str as int) as c_int - with 1 spaces 4443 6222 NaN 1.8 542.3 2.3X cast(str as int) as c_int - with 3 spaces 3659 3842 170 2.2 446.7 2.8X cast(str as int) as c_int - with 5 spaces 4372 7996 NaN 1.9 533.7 2.3X cast(str as long) as c_long - with 1 spaces 3866 5838 NaN 2.1 471.9 2.6X cast(str as long) as c_long - with 3 spaces 3793 5449 NaN 2.2 463.0 2.7X cast(str as long) as c_long - with 5 spaces 4947 5961 1198 1.7 603.9 2.1X ``` Closes #26622 from yaooqinn/cheapstringtrim. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-22 19:32:27 +08:00
Wenchen Fan	e2f056f4a8	[SPARK-29975][SQL] introduce --CONFIG_DIM directive ### What changes were proposed in this pull request? allow the sql test files to specify different dimensions of config sets during testing. For example, ``` --CONFIG_DIM1 a=1 --CONFIG_DIM1 b=2,c=3 --CONFIG_DIM2 x=1 --CONFIG_DIM2 y=1,z=2 ``` This example defines 2 config dimensions, and each dimension defines 2 config sets. We will run the queries 4 times: 1. a=1, x=1 2. a=1, y=1, z=2 3. b=2, c=3, x=1 4. b=2, c=3, y=1, z=2 ### Why are the changes needed? Currently `SQLQueryTestSuite` takes a long time. This is because we run each test at least 3 times, to check with different codegen modes. This is not necessary for most of the tests, e.g. DESC TABLE. We should only check these codegen modes for certain tests. With the --CONFIG_DIM directive, we can do things like: test different join operator(broadcast or shuffle join) X different codegen modes. After reducing testing time, we should be able to run thrifter server SQL tests with config settings. ### Does this PR introduce any user-facing change? no ### How was this patch tested? test only Closes #26612 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-22 10:56:28 +09:00
Wenchen Fan	6b4b6a87cd	[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive ### What changes were proposed in this pull request? Make `ResolveRelations` call `ResolveTables` at the beginning, and make `ResolveTables` call `ResolveTempViews`(newly added) at the beginning, to ensure the relation resolution priority. ### Why are the changes needed? To resolve an `UnresolvedRelation`, the general process is: 1. try to resolve to (global) temp view first. If it's not a temp view, move on 2. if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog. 3. when looking up table from session catalog, return a v1 relation if the table provider is v1. Currently, this process is done by 2 rules: `ResolveTables` and `ResolveRelations`. To avoid rule conflicts, we add a lot of checks: 1. `ResolveTables` only resolves `UnresolvedRelation` if it's not a temp view and the resolved table is not v1. 2. `ResolveRelations` only resolves `UnresolvedRelation` if the table name has less than 2 parts. This requires to run `ResolveTables` before `ResolveRelations`, otherwise we may resolve a v2 table to a v1 relation. To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple: 1. first run `ResolveTempViews`, see if we can resolve relation to temp view 2. then run `ResolveTables`, see if we can resolve relation to v2 tables. 3. finally run `ResolveRelations`, see if we can resolve relation to v1 tables. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26214 from cloud-fan/resolve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-11-21 09:47:42 -08:00
Ximo Guanter	54c5087a3a	[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory ### What changes were proposed in this pull request? When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions. However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience. This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written. ### Why are the changes needed? Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example: it could be used to provision a Kafka topic with a specific number of partitions it could be used to scale a microservice prior to sending the data to it it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tests passed Closes #26591 from edrevo/temp. Authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-22 00:19:25 +08:00
Takeshi Yamamuro	cdcd43cbf2	[SPARK-29977][SQL] Remove newMutableProjection/newOrdering/newNaturalAscendingOrdering from SparkPlan ### What changes were proposed in this pull request? This is to refactor `SparkPlan` code; it mainly removed `newMutableProjection`/`newOrdering`/`newNaturalAscendingOrdering` from `SparkPlan`. The other modifications are listed below; - Move `BaseOrdering` from `o.a.s.sqlcatalyst.expressions.codegen.GenerateOrdering.scala` to `o.a.s.sqlcatalyst.expressions.ordering.scala` - `RowOrdering` extends `CodeGeneratorWithInterpretedFallback ` for `BaseOrdering` - Remove the unused variables (`subexpressionEliminationEnabled` and `codeGenFallBack`) from `SparkPlan` ### Why are the changes needed? For better code/test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing. Closes #26615 from maropu/RefactorOrdering. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-21 23:51:12 +08:00
angerszhu	6146dc4562	[SPARK-29874][SQL] Optimize Dataset.isEmpty() ### What changes were proposed in this pull request? In origin way to judge if a DataSet is empty by ``` def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } ``` will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver. In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process. We change it to ``` def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan => plan.executeTake(1).isEmpty } ``` After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API. then we won't add more shuffle process and just compute only one partition's data in last stage. In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side. ### Why are the changes needed? Optimize Dataset.isEmpty() ### Does this PR introduce any user-facing change? No ### How was this patch tested? Origin UT Closes #26500 from AngersZhuuuu/SPARK-29874. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-21 18:43:21 +08:00
Kent Yao	7a70670345	[SPARK-29961][SQL] Implement builtin function - typeof ### What changes were proposed in this pull request? Add typeof function for Spark to get the underlying type of value. ```sql -- !query 0 select typeof(1) -- !query 0 schema struct<typeof(1):string> -- !query 0 output int -- !query 1 select typeof(1.2) -- !query 1 schema struct<typeof(1.2):string> -- !query 1 output decimal(2,1) -- !query 2 select typeof(array(1, 2)) -- !query 2 schema struct<typeof(array(1, 2)):string> -- !query 2 output array<int> -- !query 3 select typeof(a) from (values (1), (2), (3.1)) t(a) -- !query 3 schema struct<typeof(a):string> -- !query 3 output decimal(11,1) decimal(11,1) decimal(11,1) ``` ##### presto ```sql presto> select typeof(array[1]); _col0 ---------------- array(integer) (1 row) ``` ##### PostgreSQL ```sql postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a); pg_typeof ----------- numeric numeric numeric (3 rows) ``` ##### impala https://issues.apache.org/jira/browse/IMPALA-1597 ### Why are the changes needed? a function which is better we have to help us debug, test, develop ... ### Does this PR introduce any user-facing change? add a new function ### How was this patch tested? add ut and example Closes #26599 from yaooqinn/SPARK-29961. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-21 10:28:32 +09:00
Maxim Gekk	e6b157cf70	[SPARK-29978][SQL][TESTS] Check `json_tuple` does not truncate results ### What changes were proposed in this pull request? I propose to add a test from the commit `a936522113` for 2.4. I extended the test by a few more lengths of requested field to cover more code branches in Jackson Core. In particular, [the optimization](`5eb8973f87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (L473-L476)`) calls Jackson's method `42b8b56684/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java (L742-L746)` where the internal buffer size is 8000. In this way: - 2000 to check 2000+2000+2000 < 8000 - 2800 from the 2.4 commit. It covers the specific case: `42b8b56684/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java (L746)` - 8000-1, 8000, 8000+1 are sizes around the size of the internal buffer - 65535 to test an outstanding large field. ### Why are the changes needed? To be sure that the current implementation and future versions of Spark don't have the bug fixed in 2.4. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `JsonFunctionsSuite`. Closes #26613 from MaxGekk/json_tuple-test. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-21 09:59:31 +09:00
Sean Owen	1febd373ea	[MINOR][TESTS] Replace JVM assert with JUnit Assert in tests ### What changes were proposed in this pull request? Use JUnit assertions in tests uniformly, not JVM assert() statements. ### Why are the changes needed? assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything. ### Does this PR introduce any user-facing change? No. The assertion logic should be identical. ### How was this patch tested? Existing tests. Closes #26581 from srowen/assertToJUnit. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-20 14:04:15 -06:00
Yuanjian Li	23b3c4fafd	[SPARK-29951][SQL] Make the behavior of Postgre dialect independent of ansi mode config ### What changes were proposed in this pull request? Fix the inconsistent behavior of build-in function SQL LEFT/RIGHT. ### Why are the changes needed? As the comment in https://github.com/apache/spark/pull/26497#discussion_r345708065, Postgre dialect should not be affected by the ANSI mode config. During reran the existing tests, only the LEFT/RIGHT build-in SQL function broke the assumption. We fix this by following https://www.postgresql.org/docs/12/sql-keywords-appendix.html: `LEFT/RIGHT reserved (can be function or type)` ### Does this PR introduce any user-facing change? Yes, the Postgre dialect will not be affected by the ANSI mode config. ### How was this patch tested? Existing UT. Closes #26584 from xuanyuanking/SPARK-29951. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-21 00:56:48 +08:00
Takeshi Yamamuro	6eeb131941	[SPARK-28885][SQL][FOLLOW-UP] Re-enable the ported PgSQL regression tests of SQLQueryTestSuite ### What changes were proposed in this pull request? SPARK-28885(#26107) has supported the ANSI store assignment rules and stopped running some ported PgSQL regression tests that violate the rules. To re-activate these tests, this pr is to modify them for passing tests with the rules. ### Why are the changes needed? To make the test coverage better. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26492 from maropu/SPARK-28885-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-20 08:32:13 -08:00
Luca Canali	b5df40bd87	[SPARK-29894][SQL][WEBUI] Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab ### What changes were proposed in this pull request? The Web UI SQL Tab provides information on the executed SQL using plan graphs and by reporting SQL execution plans. Both sources provide useful information. Physical execution plans report Codegen Stage Ids. This PR adds Codegen Stage Ids to the plan graphs. ### Why are the changes needed? It is useful to have Codegen Stage Id information also reported in plan graphs, this allows to more easily match physical plans and graphs with metrics when troubleshooting SQL execution. Example snippet to show the proposed change: ![](https://issues.apache.org/jira/secure/attachment/12985837/snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png) Example of the current state: ![](https://issues.apache.org/jira/secure/attachment/12985838/snippet_plan_graph_before_patch.png) Physical plan: ![](https://issues.apache.org/jira/secure/attachment/12985932/Physical_plan_Annotated.png) ### Does this PR introduce any user-facing change? This PR adds Codegen Stage Id information to SQL plan graphs in the Web UI/SQL Tab. ### How was this patch tested? Added a test + manually tested Closes #26519 from LucaCanali/addCodegenStageIdtoWEBUIGraphs. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-20 23:20:33 +08:00
Takeshi Yamamuro	0032d85153	[SPARK-29968][SQL] Remove the Predicate code from SparkPlan ### What changes were proposed in this pull request? This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`. Modifications are listed below; - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala` - To resolve the name conflict, rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate` - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate` This comes from the cloud-fan suggestion: https://github.com/apache/spark/pull/26420#discussion_r348005497 ### Why are the changes needed? For better code/test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26604 from maropu/RefactorPredicate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-20 21:13:51 +08:00
Wenchen Fan	9e58b10c8e	[SPARK-29945][SQL] do not handle negative sign specially in the parser ### What changes were proposed in this pull request? Remove the special handling of the negative sign in the parser (interval literal and type constructor) ### Why are the changes needed? The negative sign is an operator (UnaryMinus). We don't need to handle it specially, which is kind of doing constant folding at parser side. ### Does this PR introduce any user-facing change? The error message becomes a little different. Now it reports type mismatch for the `-` operator. ### How was this patch tested? existing tests Closes #26578 from cloud-fan/interval. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-20 11:08:04 +09:00
Maxim Gekk	40b8a08b8b	[SPARK-29963][SQL][TESTS] Check formatting timestamps up to microsecond precision by JSON/CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to add tests from the commit `47cb1f359a` for Spark 2.4 that check formatting of timestamp strings for various seconds fractions. ### Why are the changes needed? To make sure that current behavior is the same as in Spark 2.4 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`. Closes #26601 from MaxGekk/format-timestamp-micros-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-20 10:34:25 +09:00
Wenchen Fan	3d2a6f464f	[SPARK-29906][SQL] AQE should not introduce extra shuffle for outermost limit ### What changes were proposed in this pull request? `AdaptiveSparkPlanExec` should forward `executeCollect` and `executeTake` to the underlying physical plan. ### Why are the changes needed? some physical plan has optimization in `executeCollect` and `executeTake`. For example, `CollectLimitExec` won't do shuffle for outermost limit. ### Does this PR introduce any user-facing change? no ### How was this patch tested? a new test This closes #26560 Closes #26576 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-11-19 10:39:38 -08:00
Jobit Mathew	6fb8b86065	[SPARK-29913][SQL] Improve Exception in postgreCastToBoolean ### What changes were proposed in this pull request? Exception improvement. ### Why are the changes needed? After selecting pgSQL dialect, queries which are failing because of wrong syntax will give long exception stack trace. For example, `explain select cast ("abc" as boolean);` Current output: > ERROR SparkSQLDriver: Failed in [explain select cast ("abc" as boolean)] > java.lang.IllegalArgumentException: invalid input syntax for type boolean: abc > at org.apache.spark.sql.catalyst.expressions.postgreSQL.PostgreCastToBoolean.$anonfun$castToBoolean$2(PostgreCastToBoolean.scala:51) > at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:277) > at org.apache.spark.sql.catalyst.expressions.postgreSQL.PostgreCastToBoolean.$anonfun$castToBoolean$1(PostgreCastToBoolean.scala:44) > at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:773) > at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:460) > at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52) > at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45) > at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286) > at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) > at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) > at org.apache.spark.sql.catalyst.plans.QueryPlan. > . > . > . ### Does this PR introduce any user-facing change? Yes. After this PR, output for above query will be: > == Physical Plan == > org.apache.spark.sql.AnalysisException: invalid input syntax for type boolean: abc; > > Time taken: 0.044 seconds, Fetched 1 row(s) > 19/11/15 15:38:57 INFO SparkSQLCLIDriver: Time taken: 0.044 seconds, Fetched 1 row(s) ### How was this patch tested? Updated existing test cases. Closes #26546 from jobitmathew/pgsqlexception. Authored-by: Jobit Mathew <jobit.mathew@huawei.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 21:30:38 +08:00
jiake	a8d98833b8	[SPARK-29893] improve the local shuffle reader performance by changing the reading task number from 1 to multi ### What changes were proposed in this pull request? This PR update the local reader task number from 1 to multi `partitionStartIndices.length`. ### Why are the changes needed? Improve the performance of local shuffle reader. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs Closes #26516 from JkSelf/improveLocalShuffleReader. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 19:18:08 +08:00
wangguangxin.cn	ffc9753037	[SPARK-29918][SQL] RecordBinaryComparator should check endianness when compared by long ### What changes were proposed in this pull request? This PR try to make sure the comparison results of `compared by 8 bytes at a time` and `compared by bytes wise` in RecordBinaryComparator is consistent, by reverse long bytes if it is little-endian and using Long.compareUnsigned. ### Why are the changes needed? If the architecture supports unaligned or the offset is 8 bytes aligned, `RecordBinaryComparator` compare 8 bytes at a time by reading 8 bytes as a long. Related code is ``` if (Platform.unaligned() \|\| (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) { while (i <= leftLen - 8) { final long v1 = Platform.getLong(leftObj, leftOff + i); final long v2 = Platform.getLong(rightObj, rightOff + i); if (v1 != v2) { return v1 > v2 ? 1 : -1; } i += 8; } } ``` Otherwise, it will compare bytes by bytes. Related code is ``` while (i < leftLen) { final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff; final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff; if (v1 != v2) { return v1 > v2 ? 1 : -1; } i += 1; } ``` However, on little-endian machine, the result of compared by a long value and compared bytes by bytes maybe different. For two same records, its offsets may vary in the first run and second run, which will lead to compare them using long comparison or byte-by-byte comparison, the result maybe different. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new test cases in RecordBinaryComparatorSuite Closes #26548 from WangGuangxin/binary_comparator. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 16:10:22 +08:00
Wenchen Fan	16134d6d0f	[SPARK-29948][SQL] make the default alias consistent between date, timestamp and interval ### What changes were proposed in this pull request? Update `Literal.sql` to make date, timestamp and interval consistent. They should all use the `TYPE 'value'` format. ### Why are the changes needed? Make the default alias consistent. For example, without this patch we will see ``` scala> sql("select interval '1 day', date '2000-10-10'").show +------+-----------------+ \|1 days\|DATE '2000-10-10'\| +------+-----------------+ \|1 days\| 2000-10-10\| +------+-----------------+ ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26579 from cloud-fan/sql. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 15:37:35 +08:00
Terry Kim	3d45779b68	[SPARK-29728][SQL] Datasource V2: Support ALTER TABLE RENAME TO ### What changes were proposed in this pull request? This PR adds `ALTER TABLE a.b.c RENAME TO x.y.x` support for V2 catalogs. ### Why are the changes needed? The current implementation doesn't support this command V2 catalogs. ### Does this PR introduce any user-facing change? Yes, now the renaming table works for v2 catalogs: ``` scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show +---------+---------+ \|namespace\|tableName\| +---------+---------+ \| ns1.ns2\| old\| +---------+---------+ scala> spark.sql("ALTER TABLE testcat.ns1.ns2.old RENAME TO testcat.ns1.ns2.new").show scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show +---------+---------+ \|namespace\|tableName\| +---------+---------+ \| ns1.ns2\| new\| +---------+---------+ ``` ### How was this patch tested? Added unit tests. Closes #26539 from imback82/rename_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 12:03:29 +08:00
shivsood	a834dba120	Revert "[SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils This reverts commit f7e53865 i.e PR #26301 from master Closes #26583 from shivsood/revert_29644_master. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-18 18:44:16 -08:00
HyukjinKwon	8469614c05	[SPARK-25694][SQL][FOLLOW-UP] Move 'spark.sql.defaultUrlStreamHandlerFactory.enabled' into StaticSQLConf.scala ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/26530 and proposes to move the configuration `spark.sql.defaultUrlStreamHandlerFactory.enabled` to `StaticSQLConf.scala` for consistency. ### Why are the changes needed? To put the similar configurations together and for readability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested as described in https://github.com/apache/spark/pull/26530. Closes #26570 from HyukjinKwon/SPARK-25694. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-19 09:08:20 +09:00
Kent Yao	ea010a2bc2	[SPARK-29873][SQL][TEST][FOLLOWUP] set operations should not escape when regen golden file with --SET --import both specified ### What changes were proposed in this pull request? When regenerating golden files, the set operations via `--SET` will not be done, but those with --import should be exceptions because we need the set command. ### Why are the changes needed? fix test tool. ### Does this PR introduce any user-facing change? ### How was this patch tested? add ut, but I'm not sure we need these tests for tests itself. cc maropu cloud-fan Closes #26557 from yaooqinn/SPARK-29873. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 01:32:13 +08:00
fuwhu	c32e228689	[SPARK-29859][SQL] ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands ### What changes were proposed in this pull request? Add AlterNamespaceSetLocationStatement, AlterNamespaceSetLocation, AlterNamespaceSetLocationExec to make ALTER DATABASE (SET LOCATION) look up catalog like v2 commands. And also refine the code of AlterNamespaceSetProperties, AlterNamespaceSetPropertiesExec, DescribeNamespace, DescribeNamespaceExec to use SupportsNamespaces instead of CatalogPlugin for catalog parameter. ### Why are the changes needed? It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes, add "ALTER NAMESPACE ... SET LOCATION" whose function is same as "ALTER DATABASE ... SET LOCATION" and "ALTER SCHEMA ... SET LOCATION". ### How was this patch tested? New unit tests Closes #26562 from fuwhu/SPARK-29859. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 20:40:23 +08:00
Kent Yao	50f6d930da	[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval ### What changes were proposed in this pull request? We now have two different implementation for multi-units interval strings to CalendarInterval type values. One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call `IntervalUtils.fromUnitStrings` The other is used in `Cast`, which eventually calls `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the other. We should unify these two for better performance and simple logic. this pr uses the 2nd approach. ### Why are the changes needed? We should unify these two for better performance and simple logic. ### Does this PR introduce any user-facing change? no ### How was this patch tested? we shall not fail on existing uts Closes #26491 from yaooqinn/SPARK-29870. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 15:50:06 +08:00
Kent Yao	5cebe587c7	[SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type ### What changes were proposed in this pull request? Add 3 interval output types which are named as `SQL_STANDARD`, `ISO_8601`, `MULTI_UNITS`. And we add a new conf `spark.sql.dialect.intervalOutputStyle` for this. The `MULTI_UNITS` style displays the interval values in the former behavior and it is the default. The newly added `SQL_STANDARD`, `ISO_8601` styles can be found in the following table. Style \| conf \| Year-Month Interval \| Day-Time Interval \| Mixed Interval -- \| -- \| -- \| -- \| -- Format With Time Unit Designators \| MULTI_UNITS \| 1 year 2 mons \| 1 days 2 hours 3 minutes 4.123456 seconds \| interval 1 days 2 hours 3 minutes 4.123456 seconds SQL STANDARD \| SQL_STANDARD \| 1-2 \| 3 4:05:06 \| -1-2 3 -4:05:06 ISO8601 Basic Format\| ISO_8601\| P1Y2M\| P3DT4H5M6S\|P-1Y-2M3D-4H-5M-6S ### Why are the changes needed? for ANSI SQL support ### Does this PR introduce any user-facing change? yes，interval out now has 3 output styles ### How was this patch tested? add new unit tests cc cloud-fan maropu MaxGekk HyukjinKwon thanks. Closes #26418 from yaooqinn/SPARK-29783. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 15:42:22 +08:00
gschiavon	73912379d0	[SPARK-29020][SQL] Improving array_sort behaviour ### What changes were proposed in this pull request? I've noticed that there are two functions to sort arrays sort_array and array_sort. sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending array_sort is from 2.4.0 and it only has the possibility of ordering in ascending. Basically I just added the possibility of ordering either ascending or descending using array_sort. I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order. Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions. ### Why are the changes needed? Basically to be able to sort the array in descending order using array_sort instead of using sort_array from 1.5.0 ### Does this PR introduce any user-facing change? Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array ### How was this patch tested? Test's added This is the link to the [jira](https://issues.apache.org/jira/browse/SPARK-29020) Closes #25728 from Gschiavon/improving-array-sort. Lead-authored-by: gschiavon <german.schiavon@lifullconnect.com> Co-authored-by: Takuya UESHIN <ueshin@databricks.com> Co-authored-by: gschiavon <Gschiavon@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-18 16:07:05 +09:00
Zhou Jiang	ee3bd6d768	[SPARK-25694][SQL] Add a config for `URL.setURLStreamHandlerFactory` ### What changes were proposed in this pull request? Add a property `spark.fsUrlStreamHandlerFactory.enabled` to allow users turn off the default registration of `org.apache.hadoop.fs.FsUrlStreamHandlerFactory` ### Why are the changes needed? This [SPARK-25694](https://issues.apache.org/jira/browse/SPARK-25694) is a long-standing issue. Originally, [[SPARK-12868][SQL] Allow adding jars from hdfs](https://github.com/apache/spark/pull/17342 ) added this for better Hive support. However, this have a side-effect when the users use Apache Spark without `-Phive`. This causes exceptions when the users tries to use another custom factories or 3rd party library (trying to set this). This configuration will unblock those non-hive users. ### Does this PR introduce any user-facing change? Yes. This provides a new user-configurable property. By default, the behavior is unchanged. ### How was this patch tested? Manual testing. BEFORE ``` $ build/sbt package $ bin/spark-shell scala> sql("show tables").show +--------+---------+-----------+ \|database\|tableName\|isTemporary\| +--------+---------+-----------+ +--------+---------+-----------+ scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory()) java.lang.Error: factory already defined at java.net.URL.setURLStreamHandlerFactory(URL.java:1134) ... 47 elided ``` AFTER ``` $ build/sbt package $ bin/spark-shell --conf spark.sql.defaultUrlStreamHandlerFactory.enabled=false scala> sql("show tables").show +--------+---------+-----------+ \|database\|tableName\|isTemporary\| +--------+---------+-----------+ +--------+---------+-----------+ scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory()) ``` Closes #26530 from jiangzho/master. Lead-authored-by: Zhou Jiang <zhou_jiang@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: zhou-jiang <zhou_jiang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-11-18 05:44:00 +00:00
xy_xin	d83cacfcf5	[SPARK-29907][SQL] Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte ### What changes were proposed in this pull request? SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to `dmlStatementNoWith`. ### Why are the changes needed? Wit this change, we can support syntax like "With t AS (SELECT) DELETE FROM xxx", and so as UPDATE/MERGE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New cases added. Closes #26536 from xianyinxin/SPARK-29907. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 11:48:56 +08:00
fuwhu	388a737b98	[SPARK-29858][SQL] ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands ### What changes were proposed in this pull request? Add AlterNamespaceSetPropertiesStatement, AlterNamespaceSetProperties and AlterNamespaceSetPropertiesExec to make ALTER DATABASE (SET DBPROPERTIES) command look up catalog like v2 commands. ### Why are the changes needed? It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes, add "ALTER NAMESPACE ... SET (DBPROPERTIES \| PROPERTIES) ..." whose function is same as "ALTER DATABASE ... SET DBPROPERTIES ..." and "ALTER SCHEMA ... SET DBPROPERTIES ...". ### How was this patch tested? New unit test Closes #26551 from fuwhu/SPARK-29858. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-16 19:50:02 -08:00
Maxim Gekk	e88267cb5a	[SPARK-29928][SQL][TESTS] Check parsing timestamps up to microsecond precision by JSON/CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to add tests from the commit `9c7e8be1dc` for Spark 2.4 that check parsing of timestamp strings for various seconds fractions. ### Why are the changes needed? To make sure that current behavior is the same as in Spark 2.4 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`. Closes #26558 from MaxGekk/parse-timestamp-micros-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-16 18:01:25 -08:00
Yuanjian Li	40ea4a11d7	[SPARK-29807][SQL] Rename "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" ### What changes were proposed in this pull request? Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" ### Why are the changes needed? The relation between "spark.sql.ansi.enabled" and "spark.sql.dialect" is confusing, since the "PostgreSQL" dialect should contain the features of "spark.sql.ansi.enabled". To make things clearer, we can rename the "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled", thus the option "spark.sql.dialect.spark.ansi.enabled" is only for Spark dialect. For the casting and arithmetic operations, runtime exceptions should be thrown if "spark.sql.dialect" is "spark" and "spark.sql.dialect.spark.ansi.enabled" is true or "spark.sql.dialect" is PostgresSQL. ### Does this PR introduce any user-facing change? Yes, the config name changed. ### How was this patch tested? Existing UT. Closes #26444 from xuanyuanking/SPARK-29807. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-16 17:46:39 +08:00
Dongjoon Hyun	f77c10de38	[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ ### What changes were proposed in this pull request? This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow. Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](https://github.com/apache/arrow/pull/5078)). > #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true". > This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty. ### Why are the changes needed? After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After https://github.com/apache/spark/pull/26133, JDK11 Jenkins job seem to fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/ ```scala Previous exception in task: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with JDK11. Closes #26552 from dongjoon-hyun/SPARK-ARROW-JDK11. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 23:58:15 -08:00
Takeshi Yamamuro	6d6b233791	[SPARK-29343][SQL][FOLLOW-UP] Remove floating-point Sum/Average/CentralMomentAgg from order-insensitive aggregates ### What changes were proposed in this pull request? This pr is to remove floating-point `Sum/Average/CentralMomentAgg` from order-insensitive aggregates in `EliminateSorts`. This pr comes from the gatorsmile suggestion: https://github.com/apache/spark/pull/26011#discussion_r344583899 ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `SubquerySuite`. Closes #26534 from maropu/SPARK-29343-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 18:54:02 -08:00
fuwhu	16e7195299	[SPARK-29834][SQL] DESC DATABASE should look up catalog like v2 commands ### What changes were proposed in this pull request? Add DescribeNamespaceStatement, DescribeNamespace and DescribeNamespaceExec to make "DESC DATABASE" look up catalog like v2 commands. ### Why are the changes needed? It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes, add "DESC NAMESPACE" whose function is same as "DESC DATABASE" and "DESC SCHEMA". ### How was this patch tested? New unit test Closes #26513 from fuwhu/SPARK-29834. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 18:50:42 -08:00
HyukjinKwon	7720781695	[SPARK-29127][SQL][PYTHON] Add a clue for Python related version information in integrated UDF tests ### What changes were proposed in this pull request? This PR proposes to show Python, pandas and PyArrow versions in integrated UDF tests as a clue so when the test cases fail, it show the related version information. I think we don't really need this kind of version information in the test case name for now since I intend that integrated SQL test cases do not target to test different combinations of Python, Pandas and PyArrow. ### Why are the changes needed? To make debug easier. ### Does this PR introduce any user-facing change? It will change test name to include related Python, pandas and PyArrow versions. ### How was this patch tested? Manually tested: ``` [info] - udf/postgreSQL/udf-case.sql - Scala UDF * FAILED * (8 seconds, 229 milliseconds) [info] udf/postgreSQL/udf-case.sql - Scala UDF ... [info] - udf/postgreSQL/udf-case.sql - Regular Python UDF * FAILED * (6 seconds, 298 milliseconds) [info] udf/postgreSQL/udf-case.sql - Regular Python UDF [info] Python: 3.7 ... [info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF * FAILED * (6 seconds, 376 milliseconds) [info] udf/postgreSQL/udf-case.sql - Scalar Pandas UDF [info] Python: 3.7 Pandas: 0.25.3 PyArrow: 0.14.0 ``` Closes #26538 from HyukjinKwon/investigate-flaky-test. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 18:37:33 -08:00
Pablo Langa	848bdfa218	[SPARK-29829][SQL] SHOW TABLE EXTENDED should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowTableStatement and make SHOW TABLE EXTENDED go through the same catalog/table resolution framework of v2 commands. We don’t have this methods in the catalog to implement an V2 command - catalog.getPartition - catalog.getTempViewOrPermanentTableMetadata ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing ```sql USE my_catalog DESC t // success and describe the table t from my_catalog SHOW TABLE EXTENDED FROM LIKE 't' // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running SHOW TABLE EXTENDED Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26540 from planga82/feature/SPARK-29481_ShowTableExtended. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 14:25:33 -08:00
Takeshi Yamamuro	ee4784bf26	[SPARK-26499][SQL][FOLLOW-UP] Replace `update` with `setByte` for ByteType in JdbcUtils.makeGetter ### What changes were proposed in this pull request? This is a follow-up pr to fix the code coming from #23400; it replaces `update` with `setByte` for ByteType in `JdbcUtils.makeGetter`. ### Why are the changes needed? For better code. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26532 from maropu/SPARK-26499-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-15 08:12:41 -06:00
Yuming Wang	4f10e54ba3	[SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions ### What changes were proposed in this pull request? In order to avoid frequently changing the value of `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions`, we usually set `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions` much larger than `spark.sql.shuffle.partitions` after enabling adaptive execution, which causes some bucket map join lose efficacy and add more `ShuffleExchange`. How to reproduce: ```scala val bucketedTableName = "bucketed_table" spark.range(10000).write.bucketBy(500, "id").sortBy("id").mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable(bucketedTableName) val bucketedTable = spark.table(bucketedTableName) val df = spark.range(8) spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) // Spark 2.4. spark.sql.adaptive.enabled=false // We set spark.sql.shuffle.partitions <= 500 every time based on our data in this case. spark.conf.set("spark.sql.shuffle.partitions", 500) bucketedTable.join(df, "id").explain() // Since 3.0. We enabled adaptive execution and set spark.sql.adaptive.shuffle.maxNumPostShufflePartitions to a larger values to fit more cases. spark.conf.set("spark.sql.adaptive.enabled", true) spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 1000) bucketedTable.join(df, "id").explain() ``` ``` scala> bucketedTable.join(df, "id").explain() == Physical Plan == (4) Project [id#5L] +- (4) SortMergeJoin [id#5L], [id#7L], Inner :- (1) Sort [id#5L ASC NULLS FIRST], false, 0 : +- (1) Project [id#5L] : +- (1) Filter isnotnull(id#5L) : +- (1) ColumnarToRow : +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500 +- (3) Sort [id#7L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#7L, 500), true, [id=#49] +- (2) Range (0, 8, step=1, splits=16) ``` vs ``` scala> bucketedTable.join(df, "id").explain() == Physical Plan == AdaptiveSparkPlan(isFinalPlan=false) +- Project [id#5L] +- SortMergeJoin [id#5L], [id#7L], Inner :- Sort [id#5L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#5L, 1000), true, [id=#93] : +- Project [id#5L] : +- Filter isnotnull(id#5L) : +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500 +- Sort [id#7L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#7L, 1000), true, [id=#92] +- Range (0, 8, step=1, splits=16) ``` This PR makes read bucketed tables always obeys `spark.sql.shuffle.partitions` even enabling adaptive execution and set `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions` to avoid add more `ShuffleExchange`. ### Why are the changes needed? Do not degrade performance after enabling adaptive execution. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #26409 from wangyum/SPARK-29655. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-15 15:49:24 +08:00
Bryan Cutler	65a189c7a1	[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html ### Why are the changes needed? Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests, manually tested with Python 3.7, 3.8 Closes #26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-15 13:27:30 +09:00
Wenchen Fan	bb8b04d4a2	[SPARK-29889][SQL][TEST] unify the interval tests ### What changes were proposed in this pull request? move interval tests to `interval.sql`, and import it to `ansi/interval.sql` ### Why are the changes needed? improve test coverage ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #26515 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-15 10:38:51 +08:00
HyukjinKwon	17321782de	[SPARK-26923][R][SQL][FOLLOW-UP] Show stderr in the exception whenever possible in RRunner ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/23977 I made a mistake related to this line: `3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL112)` Previously, 1. the reader iterator for R worker read some initial data eagerly during RDD materialization. So it read the data before actual execution. For some reasons, in this case, it showed standard error from R worker. 2. After that, when error happens during actual execution, stderr wasn't shown: `3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL260)` After my change `3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL112)`, it now ignores 1. case and only does 2. of previous code path, because 1. does not happen anymore as I avoided to such eager execution (which is consistent with PySpark code path). This PR proposes to do only 1. before/after execution always because It is pretty much possible R worker was failed during actual execution and it's best to show the stderr from R worker whenever possible. ### Why are the changes needed? It currently swallows standard error from R worker which makes debugging harder. ### Does this PR introduce any user-facing change? Yes, ```R df <- createDataFrame(list(list(n=1))) collect(dapply(df, function(x) { stop("asdkjasdjkbadskjbsdajbk") x }, structType("a double"))) ``` Before: ``` Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, 192.168.35.193, executor driver): org.apache.spark.SparkException: R worker exited unexpectedly (cranshed) at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:130) at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:118) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) at org.apache.spark. ``` After: ``` Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.35.193, executor driver): org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in computeFunc(inputData) : asdkjasdjkbadskjbsdajbk at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144) at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:128) at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegen ``` ### How was this patch tested? Manually tested and unittest was added. Closes #26517 from HyukjinKwon/SPARK-26923-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-15 11:13:36 +09:00
Terry Kim	e46e487b08	[SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly ### What changes were proposed in this pull request? This PR addresses issues where conflicting attributes in `Expand` are not correctly handled. ### Why are the changes needed? ```Scala val numsDF = Seq(1, 2, 3, 4, 5, 6).toDF("nums") val cubeDF = numsDF.cube("nums").agg(max(lit(0)).as("agcol")) cubeDF.join(cubeDF, "nums").show ``` fails with the following exception: ``` org.apache.spark.sql.AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner :- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#35] : +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36] : +- Project [nums#3, nums#3 AS nums#37] : +- Project [value#1 AS nums#3] : +- LocalRelation [value#1] +- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#58] +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36] ^^^^^^^ +- Project [nums#3, nums#3 AS nums#37] +- Project [value#1 AS nums#3] +- LocalRelation [value#1] Conflicting attributes: nums#38 ``` As you can see from the above plan, `num#38`, the output of `Expand` on the right side of `Join`, should have been handled to produce new attribute. Since the conflict is not resolved in `Expand`, the failure is happening upstream at `Aggregate`. This PR addresses handling conflicting attributes in `Expand`. ### Does this PR introduce any user-facing change? Yes, the previous example now shows the following output: ``` +----+-----+-----+ \|nums\|agcol\|agcol\| +----+-----+-----+ \| 1\| 0\| 0\| \| 6\| 0\| 0\| \| 4\| 0\| 0\| \| 2\| 0\| 0\| \| 5\| 0\| 0\| \| 3\| 0\| 0\| +----+-----+-----+ ``` ### How was this patch tested? Added new unit test. Closes #26441 from imback82/spark-29682. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-14 14:47:14 +08:00
Takeshi Yamamuro	b5a02d37e6	[SPARK-29873][SQL][TESTS] Support `--import` directive to load queries from another test case in SQLQueryTestSuite ### What changes were proposed in this pull request? This pr is to support `--import` directive to load queries from another test case in SQLQueryTestSuite. This fix comes from the cloud-fan suggestion in https://github.com/apache/spark/pull/26479#discussion_r345086978 ### Why are the changes needed? This functionality might reduce duplicate test code in `SQLQueryTestSuite`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Run `SQLQueryTestSuite`. Closes #26497 from maropu/ImportTests. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-14 14:38:27 +08:00
wuyi	fe1f456b20	[SPARK-29837][SQL] PostgreSQL dialect: cast to boolean ### What changes were proposed in this pull request? Make SparkSQL's `cast to boolean` behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. ### Why are the changes needed? SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when `spark.sql.dialect` is configured as PostgreSQL. ### Does this PR introduce any user-facing change? Yes. If user switches to PostgreSQL dialect now, they will * get an exception if they input a invalid string, e.g "erut", while they get `null` before; * get an exception if they input `TimestampType`, `DateType`, `LongType`, `ShortType`, `ByteType`, `DecimalType`, `DoubleType`, `FloatType` values, while they get `true` or `false` result before. And here're evidences for those unsupported types from PostgreSQL: timestamp: ``` postgres=# select cast(cast('2019-11-11' as timestamp) as boolean); ERROR: cannot cast type timestamp without time zone to boolean ``` date: ``` postgres=# select cast(cast('2019-11-11' as date) as boolean); ERROR: cannot cast type date to boolean ``` bigint: ``` postgres=# select cast(cast('20191111' as bigint) as boolean); ERROR: cannot cast type bigint to boolean ``` smallint: ``` postgres=# select cast(cast(2019 as smallint) as boolean); ERROR: cannot cast type smallint to boolean ``` bytea: ``` postgres=# select cast(cast('2019' as bytea) as boolean); ERROR: cannot cast type bytea to boolean ``` decimal: ``` postgres=# select cast(cast('2019' as decimal) as boolean); ERROR: cannot cast type numeric to boolean ``` float: ``` postgres=# select cast(cast('2019' as float) as boolean); ERROR: cannot cast type double precision to boolean ``` ### How was this patch tested? Added and tested manually. Closes #26463 from Ngone51/dev-postgre-cast2bool. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-14 11:55:01 +08:00
Liang-Chi Hsieh	39596b913b	[SPARK-29649][SQL] Stop task set if FileAlreadyExistsException was thrown when writing to output file ### What changes were proposed in this pull request? We already know task attempts that do not clean up output files in staging directory can cause job failure (SPARK-27194). There was proposals trying to fix it by changing output filename, or deleting existing output files. These proposals are not reliable completely. The difficulty is, as previous failed task attempt wrote the output file, at next task attempt the output file is still under same staging directory, even the output file name is different. If the job will go to fail eventually, there is no point to re-run the task until max attempts are reached. For the jobs running a lot of time, re-running the task can waste a lot of time. This patch proposes to let Spark detect such file already exist exception and stop the task set early. ### Why are the changes needed? For now, if FileAlreadyExistsException is thrown during data writing job in SQL, the job will continue re-running task attempts until max failure number is reached. It is no point for re-running tasks as task attempts will also fail because they can not write to the existing file too. We should stop the task set early. ### Does this PR introduce any user-facing change? Yes. If FileAlreadyExistsException is thrown during data writing job in SQL, no more task attempts are re-tried and the task set will be stoped early. ### How was this patch tested? Unit test. Closes #26312 from viirya/stop-taskset-if-outputfile-exists. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-13 18:01:38 -08:00
shivsood	32d44b1d0e	[SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils ### What changes were proposed in this pull request? Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType. #### Problems - In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively. ``` case ShortType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getShort(pos)) The issue was pointed out by maropu case ByteType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getByte(pos)) ``` - Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType() ``` case java.sql.Types.TINYINT => IntegerType ``` - At line 172 ShortType was wrongly interpreted as IntegerType ``` case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT)) ``` - All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes. ### Why are the changes needed? A given type should be set using the right type. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Corrected Unit test cases where applicable. Validated in CI/CD Added a test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows. ``` ./build/mvn install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 ``` Closes #26301 from shivsood/shorttype_fix_maropu. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-13 17:56:13 -08:00
Wesley Hoffman	39b502af17	[SPARK-29778][SQL] pass writer options to saveAsTable in append mode ### What changes were proposed in this pull request? `saveAsTable` had an oversight where write options were not considered in the append save mode. ### Why are the changes needed? Address the bug so that write options can be considered during appends. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test added that looks in the logic plan of `AppendData` for the existing write options. Closes #26474 from SpaceRangerWes/master. Authored-by: Wesley Hoffman <wesleyhoffman109@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-13 14:10:30 -08:00
Burak Yavuz	363af16c72	[SPARK-29568][SS] Stop existing running streams when a new stream is launched ### What changes were proposed in this pull request? This PR adds a SQL Conf: `spark.sql.streaming.stopActiveRunOnRestart`. When this conf is `true` (by default it is), an already running stream will be stopped, if a new copy gets launched on the same checkpoint location. ### Why are the changes needed? In multi-tenant environments where you have multiple SparkSessions, you can accidentally start multiple copies of the same stream (i.e. streams using the same checkpoint location). This will cause all new instantiations of the new stream to fail. However, sometimes you may want to turn off the old stream, as the old stream may have turned into a zombie (you no longer have access to the query handle or SparkSession). It would be nice to have a SQL flag that allows the stopping of the old stream for such zombie cases. ### Does this PR introduce any user-facing change? Yes. Now by default, if you launch a new copy of an already running stream on a multi-tenant cluster, the existing stream will be stopped. ### How was this patch tested? Unit tests in StreamingQueryManagerSuite Closes #26225 from brkyvz/stopStream. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-11-13 08:59:46 -08:00
Wenchen Fan	4dcbdcd265	[SPARK-29863][SQL] Rename EveryAgg/AnyAgg to BoolAnd/BoolOr ### What changes were proposed in this pull request? rename EveryAgg/AnyAgg to BoolAnd/BoolOr ### Why are the changes needed? Under ansi mode, `every`, `any` and `some` are reserved keywords and can't be used as function names. `EveryAgg`/`AnyAgg` has several aliases and I think it's better to not pick reserved keywords as the primary name. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26486 from cloud-fan/naming. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 21:42:42 +08:00
Wenchen Fan	942753a44b	[SPARK-29753][SQL] refine the default catalog config ### What changes were proposed in this pull request? rename the config to address the comment: https://github.com/apache/spark/pull/24594#discussion_r285431212 improve the config description, provide a default value to simplify the code. ### Why are the changes needed? make the config more understandable. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26395 from cloud-fan/config. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 21:27:36 +08:00
xy_xin	d7bdc6aa17	[SPARK-29835][SQL] Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE ### What changes were proposed in this pull request? The current parse and analyze flow for DELETE is: 1, the SQL string will be firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be converted to `DeleteFromTable`. However, the SQL string can be parsed to `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be redundant. It is the same for UPDATE. This pr removes the unnecessary `DeleteFromStatement` and `UpdateTableStatement`. ### Why are the changes needed? This makes the codes for DELETE and UPDATE cleaner, and keep align with MERGE INTO. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed tests and new tests. Closes #26464 from xianyinxin/SPARK-29835. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 20:53:12 +08:00
Terry Kim	b5a2ed6a37	[SPARK-29851][SQL] V2 catalog: Change default behavior of dropping namespace to cascade ### What changes were proposed in this pull request? Currently, `SupportsNamespaces.dropNamespace` drops a namespace only if it is empty. Thus, to implement a cascading drop, one needs to iterate all objects (tables, view, etc.) within the namespace (including its sub-namespaces recursively) and drop them one by one. This can have a negative impact on the performance when there are large number of objects. Instead, this PR proposes to change the default behavior of dropping a namespace to cascading such that implementing cascading/non-cascading drop is simpler without performance penalties. ### Why are the changes needed? The new behavior makes implementing cascading/non-cascading drop simple without performance penalties. ### Does this PR introduce any user-facing change? Yes. The default behavior of `SupportsNamespaces.dropNamespace` is now cascading. ### How was this patch tested? Added new unit tests. Closes #26476 from imback82/drop_ns_cascade. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 17:06:27 +08:00
Kent Yao	f926809a1f	[SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions ### What changes were proposed in this pull request? Add 3 interval functions justify_days, justify_hours, justif_interval to support justify interval values ### Why are the changes needed? For feature parity with postgres add three interval functions to justify interval values. justify_days(interval) \| interval \| Adjust interval so 30-day time periods are represented as months \| justify_days(interval '35 days') \| 1 mon 5 days -- \| -- \| -- \| -- \| -- justify_hours(interval) \| interval \| Adjust interval so 24-hour time periods are represented as days \| justify_hours(interval '27 hours') \| 1 day 03:00:00 justify_interval(interval) \| interval \| Adjust interval using justify_days and justify_hours, with additional sign adjustments \| justify_interval(interval '1 mon -1 hour') \| 29 days 23:00:00 ### Does this PR introduce any user-facing change? yes. new interval functions are added ### How was this patch tested? add ut Closes #26465 from yaooqinn/SPARK-29390. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-13 15:04:39 +09:00
HyukjinKwon	80fbc382a6	Revert "[SPARK-29462] The data type of "array()" should be array<null>" This reverts commit `0dcd739534`.	2019-11-13 13:12:20 +09:00
angerszhu	eb79af8dae	[SPARK-29145][SQL][FOLLOW-UP] Move tests from `SubquerySuite` to `subquery/in-subquery/in-joins.sql` ### What changes were proposed in this pull request? Follow comment of https://github.com/apache/spark/pull/25854#discussion_r342383272 ### Why are the changes needed? NO ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD TEST CASE Closes #26406 from AngersZhuuuu/SPARK-29145-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-12 17:34:03 -08:00
Ankitraj	45e212e161	[SPARK-29570][WEBUI] Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns ### What changes were proposed in this pull request? All tooltips message will display in centre. ### Why are the changes needed? Some time tooltips will hide the data of column and tooltips display position will be inconsistent in UI. ### Does this PR introduce any user-facing change? yes. ![Screenshot 2019-10-26 at 3 08 51 AM](https://user-images.githubusercontent.com/8948111/67606124-04dd0d80-f79e-11e9-865a-b7e9bffc9890.png) ### How was this patch tested? Manual test. Closes #26263 from 07ARB/SPARK-29570. Lead-authored-by: Ankitraj <8948111+07ARB@users.noreply.github.com> Co-authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-12 18:49:54 -06:00
Wenchen Fan	030e5d987e	[SPARK-29789][SQL] should not parse the bucket column name when creating v2 tables ### What changes were proposed in this pull request? When creating v2 expressions, we have public java APIs, as well as interval scala APIs. All of these APIs take a string column name and parse it to `NamedReference`. This is convenient for end-users, but not for interval development. For example, the query plan already contains the parsed partition/bucket column names, and it's tricky if we need to quote the names before creating v2 expressions. This PR proposes to change the interval scala APIs to take `NamedReference` directly, with a new method to create `NamedReference` with the exact name parts. The public java APIs are not changed. ### Why are the changes needed? fix a bug, and make it easier to create v2 expressions correctly in the future. ### Does this PR introduce any user-facing change? yes, now v2 CREATE TABLE works as expected. ### How was this patch tested? a new test Closes #26425 from cloud-fan/extract. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-11-12 12:25:45 -08:00
Wenchen Fan	414cade011	[SPARK-29850][SQL] sort-merge-join an empty table should not memory leak ### What changes were proposed in this pull request? When whole stage codegen `HashAggregateExec`, create the hash map when we begin to process inputs. ### Why are the changes needed? Sort-merge join completes directly if the left side table is empty. If there is an aggregate in the right side, the aggregate will not be triggered at all, but its hash map is created during codegen and can't be released. ### Does this PR introduce any user-facing change? No ### How was this patch tested? a new test Closes #26471 from cloud-fan/memory. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-13 01:00:30 +08:00
Kent Yao	d99398e9f5	[SPARK-29855][SQL] typed literals with negative sign with proper result or exception ### What changes were proposed in this pull request? ```sql -- !query 83 select -integer '7' -- !query 83 schema struct<7:int> -- !query 83 output 7 -- !query 86 select -date '1999-01-01' -- !query 86 schema struct<DATE '1999-01-01':date> -- !query 86 output 1999-01-01 -- !query 87 select -timestamp '1999-01-01' -- !query 87 schema struct<TIMESTAMP('1999-01-01 00:00:00'):timestamp> -- !query 87 output 1999-01-01 00:00:00 ``` the integer should be -7 and the date and timestamp results are confusing which should throw exceptions ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD UTs Closes #26479 from yaooqinn/SPARK-29855. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-12 23:53:07 +09:00
Pablo Langa	37e387a22d	[SPARK-29519][SQL] SHOW TBLPROPERTIES should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowTablePropertiesStatement and make SHOW TBLPROPERTIES go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. USE my_catalog DESC t // success and describe the table t from my_catalog SHOW TBLPROPERTIES t // report table not found as there is no table t in the session catalog ### Does this PR introduce any user-facing change? yes. When running SHOW TBLPROPERTIES Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26176 from planga82/feature/SPARK-29519_SHOW_TBLPROPERTIES_datasourceV2. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-12 13:31:28 +08:00
Jungtaek Lim (HeartSaVioR)	c941362cb9	[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows ### What changes were proposed in this pull request? This patch fixes the edge case of streaming left/right outer join described below: Suppose query is provided as `select * from A join B on A.id = B.id AND (A.ts <= B.ts AND B.ts <= A.ts + interval 5 seconds)` and there're two rows for L1 (from A) and R1 (from B) which ensures L1.id = R1.id and L1.ts = R1.ts. (we can simply imagine it from self-join) Then Spark processes L1 and R1 as below: - row L1 and row R1 are joined at batch 1 - row R1 is evicted at batch 2 due to join and watermark condition, whereas row L1 is not evicted - row L1 is evicted at batch 3 due to join and watermark condition When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below: ``` Checking whether the current row matches a key in the right side state, and that key has any value which satisfies the filter function when joined. If it doesn't, we know we can join with null, since there was never (including this batch) a match within the watermark period. If it does, there must have been a match at some point, so we know we can't join with null. ``` But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched. To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows. This approach introduces new state format which is not compatible with old state format - queries with old state format will be still running but they will still have the issue and be required to discard checkpoint and rerun to take this patch in effect. ### Why are the changes needed? This patch fixes a correctness issue. ### Does this PR introduce any user-facing change? No for compatibility viewpoint, but we'll encourage end users to discard the old checkpoint and rerun the query if they run stream-stream outer join query with old checkpoint, which might be "yes" for the question. ### How was this patch tested? Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs. Closes #26108 from HeartSaVioR/SPARK-26154-shorten-alternative. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-11-11 15:47:17 -08:00
Marcelo Vanzin	9753a8e330	[SPARK-29766][SQL] Do metrics aggregation asynchronously in SQL listener This unblocks the event handling thread, which should help avoid dropped events when large queries are running. Existing unit tests should already cover this code. Closes #26405 from vanzin/SPARK-29766. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 14:20:34 -08:00
Takeshi Yamamuro	cceb2d6f11	[SPARK-29825][SQL][TESTS] Add join-related configs in `inner-join.sql` and `postgreSQL/join.sql` ### What changes were proposed in this pull request? For better test coverage, this pr is to add join-related configs in `inner-join.sql` and `postgreSQL/join.sql`. These join related configs were just copied from ones in the other join-related tests in `SQLQueryTestSuite` (e.g., https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/natural-join.sql#L2-L4). ### Why are the changes needed? Better test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26459 from maropu/AddJoinConds. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 10:21:33 -08:00
Kent Yao	d06a9cc4bd	[SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values ### What changes were proposed in this pull request? With the latest string to literal optimization https://github.com/apache/spark/pull/26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces. How to re-produce, which aim the revisions since https://github.com/apache/spark/pull/26256 is merged ```sql select cast(v as interval) from values ('+ 1 second') t(v); select cast(v as interval) from values ('- 1 second') t(v); ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. ut 2. new benchmark test Closes #26449 from yaooqinn/SPARK-29605. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-11 21:53:33 +08:00
lajin	4de7131cff	[SPARK-29421][SQL] Supporting Create Table Like Using Provider ### What changes were proposed in this pull request? Hive support STORED AS new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` We add a similar syntax for Spark. Here we separate to two features: 1. specify a different table provider in CREATE TABLE LIKE 2. Hive compatibility In this PR, we address the first one: - [ ] Using `USING provider` to specify a different table provider in CREATE TABLE LIKE. - [ ] Using `STORED AS file_format` in CREATE TABLE LIKE to address Hive compatibility. ### Why are the changes needed? Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the definition of table tb2. The most user case is to create tb1 with the same schema of tb2. But an inconvenient case here is this command also copies the FileFormat from tb2, it cannot change the input/output format and serde. Add the ability of changing file format is useful for some scenarios like upgrading a table from a low performance file format to a high performance one (parquet, orc). ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: ```sql CREATE TABLE tbl2 LIKE tbl [USING parquet]; ``` ### How was this patch tested? Modify some exist UTs. Closes #26097 from LantaoJin/SPARK-29421. Authored-by: lajin <lajin@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-11 15:25:56 +08:00
Maxim Gekk	18440151b0	[SPARK-29393][SQL] Add `make_interval` function ### What changes were proposed in this pull request? In the PR, I propose new expression `MakeInterval` and register it as the function `make_interval`. The function accepts the following parameters: - `years` - the number of years in the interval, positive or negative. The parameter is multiplied by 12, and added to interval's `months`. - `months` - the number of months in the interval, positive or negative. - `weeks` - the number of months in the interval, positive or negative. The parameter is multiplied by 7, and added to interval's `days`. - `hours`, `mins` - the number of hours and minutes. The parameters can be negative or positive. They are converted to microseconds and added to interval's `microseconds`. - `seconds` - the number of seconds with the fractional part in microseconds precision. It is converted to microseconds, and added to total interval's `microseconds` as `hours` and `minutes`. For example: ```sql spark-sql> select make_interval(2019, 11, 1, 1, 12, 30, 01.001001); 2019 years 11 months 8 days 12 hours 30 minutes 1.001001 seconds ``` ### Why are the changes needed? - To improve user experience with Spark SQL, and allow users making `INTERVAL` columns from other columns containing `years`, `months` ... `seconds`. Currently, users can make an `INTERVAL` column from other columns only by constructing a `STRING` column and cast it to `INTERVAL`. Have a look at the `IntervalBenchmark` as an example. - To maintain feature parity with PostgreSQL which provides such function: ```sql # SELECT make_interval(2019, 11); make_interval -------------------- 2019 years 11 mons ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By new tests for the `MakeInterval` expression to `IntervalExpressionsSuite` - By tests in `interval.sql` Closes #26446 from MaxGekk/make_interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 14:34:52 -08:00
Maxim Gekk	d4de01f567	[SPARK-29408][SQL] Support `-` before `interval` in interval literals ### What changes were proposed in this pull request? - `SqlBase.g4` is modified to support a negative sign `-` in the interval type constructor from a string and in interval literals - Negate interval in `AstBuilder` if a sign presents. - Interval related SQL statements are moved from `inputs/datetime.sql` to new file `inputs/interval.sql` For example: ```sql spark-sql> select -interval '-1 month 1 day -1 second'; 1 months -1 days 1 seconds spark-sql> select -interval -1 month 1 day -1 second; 1 months -1 days 1 seconds ``` ### Why are the changes needed? For feature parity with PostgreSQL which supports that: ```sql # select -interval '-1 month 1 day -1 second'; ?column? ------------------------- 1 mon -1 days +00:00:01 (1 row) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added tests to `ExpressionParserSuite` - by `interval.sql` Closes #26438 from MaxGekk/negative-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-10 10:10:04 -08:00
Huaxin Gao	57b954e825	[SPARK-29730][SQL] ALTER VIEW QUERY should look up catalog/table like v2 commands Add AlterViewAsStatement and make ALTER VIEW ... QUERY go through the same catalog/table resolution framework of v2 commands. It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC v // success and describe the view v from my_catalog ALTER VIEW v SELECT 1 // report view not found as there is no view v in the session catalog ``` Yes. When running ALTER VIEW ... QUERY, Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog. unit tests Closes #26453 from huaxingao/spark-29730. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-09 17:06:09 -08:00
xy_xin	7cfd589868	[SPARK-28893][SQL] Support MERGE INTO in the parser and add the corresponding logical plan ### What changes were proposed in this pull request? This PR supports MERGE INTO in the parser and add the corresponding logical plan. The SQL syntax likes, ``` MERGE INTO [ds_catalog.][multi_part_namespaces.]target_table [AS target_alias] USING [ds_catalog.][multi_part_namespaces.]source_table \| subquery [AS source_alias] ON <merge_condition> [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ] [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ] [ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ] ``` where ``` <matched_action> = DELETE \| UPDATE SET * \| UPDATE SET column1 = value1 [, column2 = value2 ...] <not_matched_action> = INSERT * \| INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]) ``` ### Why are the changes needed? This is a start work for introduce `MERGE INTO` support for the builtin datasource, and the design work for the `MERGE INTO` support in DSV2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New test cases. Closes #26167 from xianyinxin/SPARK-28893. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-09 11:45:24 +08:00
Liang-Chi Hsieh	70987d8144	[SPARK-29680][SQL][FOLLOWUP] Replace qualifiedName with multipartIdentifier ### What changes were proposed in this pull request? Replace qualifiedName with multipartIdentifier in parser rules of DDL commands. ### Why are the changes needed? There are identifiers in some DDL rules we use `qualifiedName`. We should use `multipartIdentifier` because it can capture wrong identifiers such as `test-table`, `test-col`. ### Does this PR introduce any user-facing change? Yes. Wrong identifiers such as test-table, will be captured now after this change. ### How was this patch tested? Unit tests. Closes #26419 from viirya/SPARK-29680-followup2. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-11-08 14:18:06 -08:00
Kent Yao	e026412d9c	[SPARK-29679][SQL] Make interval type comparable and orderable ### What changes were proposed in this pull request? interval type support >, >=, <, <=, =, <=>, order by, min,max.. ### Why are the changes needed? Part of SPARK-27764 Feature Parity between PostgreSQL and Spark ### Does this PR introduce any user-facing change? yes, we now support compare intervals ### How was this patch tested? add ut Closes #26337 from yaooqinn/SPARK-29679. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 22:45:11 +08:00
Kent Yao	e7f7990bc3	[SPARK-29688][SQL] Support average for interval type values ### What changes were proposed in this pull request? avg aggregate support interval type values ### Why are the changes needed? Part of SPARK-27764 Feature Parity between PostgreSQL and Spark ### Does this PR introduce any user-facing change? yes, we can do avg on intervals ### How was this patch tested? add ut Closes #26347 from yaooqinn/SPARK-29688. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 21:55:07 +08:00
ulysses	7759f7179c	[SPARK-29772][TESTS][SQL] Add withNamespace in SQLTestUtils ### What changes were proposed in this pull request? V2 catalog support namespace, we should add `withNamespace` like `withDatabase`. ### Why are the changes needed? Make test easy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add UT. Closes #26411 from ulysses-you/Add-test-with-namespace. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-08 11:53:44 +08:00
Kent Yao	9562b26914	[SPARK-29757][SQL] Move calendar interval constants together ### What changes were proposed in this pull request? ```java public static final int YEARS_PER_DECADE = 10; public static final int YEARS_PER_CENTURY = 100; public static final int YEARS_PER_MILLENNIUM = 1000; public static final byte MONTHS_PER_QUARTER = 3; public static final int MONTHS_PER_YEAR = 12; public static final byte DAYS_PER_WEEK = 7; public static final long DAYS_PER_MONTH = 30L; public static final long HOURS_PER_DAY = 24L; public static final long MINUTES_PER_HOUR = 60L; public static final long SECONDS_PER_MINUTE = 60L; public static final long SECONDS_PER_HOUR = MINUTES_PER_HOUR * SECONDS_PER_MINUTE; public static final long SECONDS_PER_DAY = HOURS_PER_DAY * SECONDS_PER_HOUR; public static final long MILLIS_PER_SECOND = 1000L; public static final long MILLIS_PER_MINUTE = SECONDS_PER_MINUTE * MILLIS_PER_SECOND; public static final long MILLIS_PER_HOUR = MINUTES_PER_HOUR * MILLIS_PER_MINUTE; public static final long MILLIS_PER_DAY = HOURS_PER_DAY * MILLIS_PER_HOUR; public static final long MICROS_PER_MILLIS = 1000L; public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * MICROS_PER_MILLIS; public static final long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * MICROS_PER_SECOND; public static final long MICROS_PER_HOUR = MINUTES_PER_HOUR * MICROS_PER_MINUTE; public static final long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR; public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY; /* 365.25 days per year assumes leap year every four years / public static final long MICROS_PER_YEAR = (36525L MICROS_PER_DAY) / 100; public static final long NANOS_PER_MICROS = 1000L; public static final long NANOS_PER_MILLIS = MICROS_PER_MILLIS * NANOS_PER_MICROS; public static final long NANOS_PER_SECOND = MILLIS_PER_SECOND * NANOS_PER_MILLIS; ``` The above parameters are defined in IntervalUtils, DateTimeUtils, and CalendarInterval, some of them are redundant, some of them are cross-referenced. ### Why are the changes needed? To simplify code, enhance consistency and reduce risks ### Does this PR introduce any user-facing change? no ### How was this patch tested? modified uts Closes #26399 from yaooqinn/SPARK-29757. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-07 19:48:19 +08:00
Wenchen Fan	9b61f90987	[SPARK-29761][SQL] do not output leading 'interval' in CalendarInterval.toString ### What changes were proposed in this pull request? remove the leading "interval" in `CalendarInterval.toString`. ### Why are the changes needed? Although it's allowed to have "interval" prefix when casting string to int, it's not recommended. This is also consistent with pgsql: ``` cloud0fan=# select interval '1' day; interval ---------- 1 day (1 row) ``` ### Does this PR introduce any user-facing change? yes, when display a dataframe with interval type column, the result is different. ### How was this patch tested? updated tests. Closes #26401 from cloud-fan/interval. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-07 15:44:50 +08:00
Maxim Gekk	29dc59ac29	[SPARK-29605][SQL] Optimize string to interval casting ### What changes were proposed in this pull request? In the PR, I propose new function `stringToInterval()` in `IntervalUtils` for converting `UTF8String` to `CalendarInterval`. The function is used in casting a `STRING` column to an `INTERVAL` column. ### Why are the changes needed? The proposed implementation is ~10 times faster. For example, parsing 9 interval units on JDK 8: Before: ``` 9 units w/ interval 14004 14125 116 0.1 14003.6 0.0X 9 units w/o interval 13785 14056 290 0.1 13784.9 0.0X ``` After: ``` 9 units w/ interval 1343 1344 1 0.7 1343.0 0.3X 9 units w/o interval 1345 1349 8 0.7 1344.6 0.3X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By new tests for `stringToInterval` in `IntervalUtilsSuite` - By existing tests Closes #26256 from MaxGekk/string-to-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-07 12:39:52 +08:00
Kent Yao	3437862975	[SPARK-29387][SQL][FOLLOWUP] Fix issues of the multiply and divide for intervals ### What changes were proposed in this pull request? Handle the inconsistence dividing zeros between literals and columns. fix the null issue too. ### Why are the changes needed? BUG FIX ### 1 Handle the inconsistence dividing zeros between literals and columns ```sql -- !query 24 select k, v, cast(k as interval) / v, cast(k as interval) * v from VALUES ('1 seconds', 1), ('2 seconds', 0), ('3 seconds', null), (null, null), (null, 0) t(k, v) -- !query 24 schema struct<k:string,v:int,divide_interval(CAST(k AS INTERVAL), CAST(v AS DOUBLE)):interval,multiply_interval(CAST(k AS INTERVAL), CAST(v AS DOUBLE)):interval> -- !query 24 output 1 seconds 1 interval 1 seconds interval 1 seconds 2 seconds 0 interval 0 microseconds interval 0 microseconds 3 seconds NULL NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL ``` ```sql -- !query 21 select interval '1 year 2 month' / 0 -- !query 21 schema struct<divide_interval(interval 1 years 2 months, CAST(0 AS DOUBLE)):interval> -- !query 21 output NULL ``` in the first case, interval ’2 seconds ‘ / 0, it produces `interval 0 microseconds ` in the second case, it is `null` ### 2 null literal issues ```sql -- !query 20 select interval '1 year 2 month' / null -- !query 20 schema struct<> -- !query 20 output org.apache.spark.sql.AnalysisException cannot resolve '(interval 1 years 2 months / NULL)' due to data type mismatch: differing types in '(interval 1 years 2 months / NULL)' (interval and null).; line 1 pos 7 -- !query 22 select interval '4 months 2 weeks 6 days' * null -- !query 22 schema struct<> -- !query 22 output org.apache.spark.sql.AnalysisException cannot resolve '(interval 4 months 20 days * NULL)' due to data type mismatch: differing types in '(interval 4 months 20 days * NULL)' (interval and null).; line 1 pos 7 -- !query 23 select null * interval '4 months 2 weeks 6 days' -- !query 23 schema struct<> -- !query 23 output org.apache.spark.sql.AnalysisException cannot resolve '(NULL * interval 4 months 20 days)' due to data type mismatch: differing types in '(NULL * interval 4 months 20 days)' (null and interval).; line 1 pos 7 ``` dividing or multiplying null literals, error occurs; where in column is fine as the first case ### Does this PR introduce any user-facing change? NO, maybe yes, but it is just a follow-up ### How was this patch tested? add uts cc cloud-fan MaxGekk maropu Closes #26410 from yaooqinn/SPARK-29387. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-07 12:19:03 +08:00
Wenchen Fan	1f3863c856	[SPARK-29759][SQL] LocalShuffleReaderExec.outputPartitioning should use the corrected attributes ### What changes were proposed in this pull request? Update `LocalShuffleReaderExec.outputPartitioning` to use attributes from `ReusedQueryStage`. This also removes the override `doCanonicalize` in local/coalesced shuffle reader, as these 2 operators change the output partitioning. It's not safe to strip them in the canonicalized query plan. ### Why are the changes needed? We will have an invalid output partitioning if we don fix it. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26400 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-11-06 14:33:52 -08:00
Jungtaek Lim (HeartSaVioR)	782992c7ed	[SPARK-29642][SS] Change the element type of underlying array to UnsafeRow for ContinuousRecordEndpoint ### What changes were proposed in this pull request? This patch fixes the bug that `ContinuousMemoryStream[String]` throws error regarding ClassCastException - cast String to UTFString. This is because ContinuousMemoryStream and ContinuousRecordEndpoint uses origin input as it is for underlying data structure of Row, and encoding is missing here. To force encoding, this patch changes the element type of underlying array to UnsafeRow instead of Any for ContinuousRecordEndpoint - ContinuousMemoryStream and TextSocketContinuousStream are modified to reflect the change. ### Why are the changes needed? Above section describes the bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add new UT to check for availability on couple of types. Closes #26300 from HeartSaVioR/SPARK-29642. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-11-06 10:37:00 -08:00
Wenchen Fan	411015300e	[SPARK-29752][SQL][TEST] make AdaptiveQueryExecSuite more robust ### What changes were proposed in this pull request? instead of checking the exact number of local shuffle readers, we should check whether the number of shuffles is equal to the number of local readers. ### Why are the changes needed? AQE is known to have randomness. We may pick different build side for broadcast join depending on which query stage finishes first. The decision to build side may add/remove shuffles downstream, so it's flaky to check the exact number of local shuffle readers. ### Does this PR introduce any user-facing change? no ### How was this patch tested? test only PR. Closes #26394 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-11-06 10:27:39 -08:00
Aman Omer	0dcd739534	[SPARK-29462] The data type of "array()" should be array<null> ### What changes were proposed in this pull request? During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type . ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually Closes #26324 from amanomer/29462. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-06 18:39:46 +09:00
Liang-Chi Hsieh	6233958ab6	[SPARK-29680][SQL] Remove ALTER TABLE CHANGE COLUMN syntax ### What changes were proposed in this pull request? This patch removes v1 ALTER TABLE CHANGE COLUMN syntax. ### Why are the changes needed? Since in v2 we have ALTER TABLE CHANGE COLUMN and ALTER TABLE RENAME COLUMN, this old syntax is not necessary now and can be confusing. The v2 ALTER TABLE CHANGE COLUMN should fallback to v1 AlterTableChangeColumnCommand (#26354). ### Does this PR introduce any user-facing change? Yes, the old v1 ALTER TABLE CHANGE COLUMN syntax is removed. ### How was this patch tested? Unit tests. Closes #26338 from viirya/SPARK-29680. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-06 10:42:44 +08:00
Takeshi Yamamuro	20b9d8259b	[SPARK-29714][SQL][TESTS] Port insert.sql ### What changes were proposed in this pull request? This PR ports insert.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/insert.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/insert.out ### Why are the changes needed? To check behaviour differences between Spark and PostgreSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results Closes #26360 from maropu/InsertTest. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-05 16:44:54 -08:00
Maxim Gekk	4c53ac1822	[SPARK-29387][SQL] Support `` and `/` operators for intervals ### What changes were proposed in this pull request? Added new expressions `MultiplyInterval` and `DivideInterval` to multiply/divide an interval by a numeric. Updated `TypeCoercion.DateTimeOperations` to turn the `Multiply`/`Divide` expressions of `CalendarIntervalType` and `NumericType` to `MultiplyInterval`/`DivideInterval`. To support new operations, added new methods `multiply()` and `divide()` to `CalendarInterval`. ### Why are the changes needed? - To maintain feature parity with PostgreSQL which supports multiplication and division of intervals by doubles: ```sql # select interval '1 hour' / double precision '1.5'; ?column? ---------- 00:40:00 ``` - To conform the SQL standard which defines those operations: `numeric interval`, `interval * numeric` and `interval / numeric`. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt). - Improve Spark SQL UX and allow users to adjust interval columns. For example: ```sql spark-sql> select (timestamp'now' - timestamp'yesterday') * 1.3; interval 2 days 10 hours 39 minutes 38 seconds 568 milliseconds 900 microseconds ``` ### Does this PR introduce any user-facing change? Yes, previously the following query fails with the error: ```sql spark-sql> select interval 1 hour 30 minutes * 1.5; Error in query: cannot resolve '(interval 1 hours 30 minutes * 1.5BD)' due to data type mismatch: differing types in '(interval 1 hours 30 minutes * 1.5BD)' (interval and decimal(2,1)).; line 1 pos 7; ``` After: ```sql spark-sql> select interval 1 hour 30 minutes * 1.5; interval 2 hours 15 minutes ``` ### How was this patch tested? - Added tests for the `multiply()` and `divide()` methods to `CalendarIntervalSuite.java` - New test suite `IntervalExpressionsSuite` - by tests for `Multiply` -> `MultiplyInterval` and `Divide` -> `DivideInterval` in `TypeCoercionSuite` - updated `datetime.sql` Closes #26132 from MaxGekk/interval-mul-div. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-06 00:37:43 +08:00
Takeshi Yamamuro	41be5125a1	[SPARK-29648][SQL][TESTS] Port limit.sql ### What changes were proposed in this pull request? This PR ports limit.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/limit.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/limit.out ### Why are the changes needed? To check behaviour differences between Spark and PostgreSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results Closes #26311 from maropu/SPARK-29648. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-04 22:12:27 -08:00
Huaxin Gao	02eecfec99	[SPARK-29695][SQL] ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableSerDePropertiesStatement and make ALTER TABLE ... SET SERDE/SERDEPROPERTIES go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t SET SERDE 'org.apache.class' // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... SET SERDE/SERDEPROPERTIES, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26374 from huaxingao/spark_29695. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-04 21:42:39 -08:00
Terry Kim	66619b84d8	[SPARK-29630][SQL] Disallow creating a permanent view that references a temporary view in an expression ### What changes were proposed in this pull request? Disallow creating a permanent view that references a temporary view in expressions. ### Why are the changes needed? Creating a permanent view that references a temporary view is currently disallowed. For example, ```SQL # The following throws org.apache.spark.sql.AnalysisException # Not allowed to create a permanent view `per_view` by referencing a temporary view `tmp`; CREATE VIEW per_view AS SELECT t1.a, t2.b FROM base_table t1, (SELECT * FROM tmp) t2" ``` However, the following is allowed. ```SQL CREATE VIEW per_view AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM tmp); ``` This PR fixes the bug where temporary views used inside expressions are not checked. ### Does this PR introduce any user-facing change? Yes. Now the following SQL query throws an exception as expected: ```SQL # The following throws org.apache.spark.sql.AnalysisException # Not allowed to create a permanent view `per_view` by referencing a temporary view `tmp`; CREATE VIEW per_view AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM tmp); ``` ### How was this patch tested? Added new unit tests. Closes #26361 from imback82/spark-29630. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-05 13:19:46 +08:00
Takeshi Yamamuro	942a057934	[SPARK-29696][SQL][TESTS] Port groupingsets.sql ### What changes were proposed in this pull request? This PR ports groupingsets.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/groupingsets.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/groupingsets.out ### Why are the changes needed? To check behaviour differences between Spark and PostgreSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results Closes #26352 from maropu/GgroupingSets. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-04 19:06:28 -08:00

... 4 5 6 7 8 ...

6657 commits