ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuanjian Li	a6b91d2bf7	[SPARK-30556][SQL][FOLLOWUP] Reset the status changed in SQLExecution.withThreadLocalCaptured ### What changes were proposed in this pull request? Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured. ### Why are the changes needed? For code safety. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #27516 from xuanyuanking/SPARK-30556-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: herman <herman@databricks.com>	2020-02-10 22:16:25 +01:00
Liang-Chi Hsieh	acfdb46a60	[SPARK-27946][SQL][FOLLOW-UP] Change doc and error message for SHOW CREATE TABLE ### What changes were proposed in this pull request? This is a follow-up for #24938 to tweak error message and migration doc. ### Why are the changes needed? Making user know workaround if SHOW CREATE TABLE doesn't work for some Hive tables. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27505 from viirya/SPARK-27946-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2020-02-10 10:45:00 -08:00
jiake	5a240603fd	[SPARK-30719][SQL] Add unit test to verify the log warning print when intentionally skip AQE ### What changes were proposed in this pull request? This is a follow up in [#27452](https://github.com/apache/spark/pull/27452). Add a unit test to verify whether the log warning is print when intentionally skip AQE. ### Why are the changes needed? Add unit test ### Does this PR introduce any user-facing change? No ### How was this patch tested? adding unit test Closes #27515 from JkSelf/aqeLoggingWarningTest. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 21:48:00 +08:00
Kent Yao	58b9ca1e6f	[SPARK-30592][SQL][FOLLOWUP] Add some round-trip test cases ### What changes were proposed in this pull request? Add round-trip tests for CSV and JSON functions as https://github.com/apache/spark/pull/27317#discussion_r376745135 asked. ### Why are the changes needed? improve test coverage ### Does this PR introduce any user-facing change? no ### How was this patch tested? add uts Closes #27510 from yaooqinn/SPARK-30592-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-10 16:23:44 +09:00
Liang-Chi Hsieh	9f8172e96a	Revert "[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project This reverts commit `a0e63b61e7`. ### What changes were proposed in this pull request? This reverts the patch at #26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-09 19:45:16 -08:00
Gengliang Wang	b877aac146	[SPARK-30684 ][WEBUI][FollowUp] A new approach for SPARK-30684 ### What changes were proposed in this pull request? Simplify the changes for adding metrics description for WholeStageCodegen in https://github.com/apache/spark/pull/27405 ### Why are the changes needed? In https://github.com/apache/spark/pull/27405, the UI changes can be made without using the function `adjustPositionOfOperationName` to adjust the position of operation name and mark as an operation-name class. I suggest we make simpler changes so that it would be easier for future development. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual test with the queries provided in https://github.com/apache/spark/pull/27405 ``` sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output") sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output") ``` ![image](https://user-images.githubusercontent.com/1097932/74073629-e3f09f00-49bf-11ea-90dc-1edb5ca29e5e.png) Closes #27490 from gengliangwang/wholeCodegenUI. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-09 14:18:51 -08:00
Nicholas Chammas	339c0f9a62	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options ### What changes were proposed in this pull request? This PR adds a doc builder for Spark SQL's configuration options. Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)): ![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png) Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql): ![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png) ### Why are the changes needed? There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql). ### Does this PR introduce any user-facing change? No, apart from new documentation. ### How was this patch tested? I tested this manually by building the docs and reviewing them in my browser. Closes #27459 from nchammas/SPARK-30510-spark-sql-options. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-09 19:20:47 +09:00
Yuanjian Li	3db3e39f11	[SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause ### What changes were proposed in this pull request? This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior. ### Why are the changes needed? The original change might risky to end-users, it changes behavior silently. ### Does this PR introduce any user-facing change? Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional. ### How was this patch tested? New UT. Closes #27454 from xuanyuanking/SPARK-28228-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-08 14:10:28 -08:00
Terry Kim	a7451f44d2	[SPARK-30614][SQL] The native ALTER COLUMN syntax should change one property at a time ### What changes were proposed in this pull request? The current ALTER COLUMN syntax allows to change multiple properties at a time: ``` ALTER TABLE table=multipartIdentifier (ALTER \| CHANGE) COLUMN? column=multipartIdentifier (TYPE dataType)? (COMMENT comment=STRING)? colPosition? ``` The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan) This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time. Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior. ### Why are the changes needed? To follow SQL standard (and other recent SQL systems) behavior. ### Does this PR introduce any user-facing change? Yes, now the user can update the column properties only one at a time. For example, ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment' ``` should be broken into ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment' ``` ### How was this patch tested? Updated existing tests. Closes #27444 from imback82/alter_column_one_at_a_time. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:47:44 +08:00
Maxim Gekk	a3e77773cf	[SPARK-30752][SQL] Fix `to_utc_timestamp` on daylight saving day ### What changes were proposed in this pull request? - Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API - Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence. ### Why are the changes needed? - Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example: ```scala scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show +-------------------+ \| local UTC\| +-------------------+ \|2019-11-03 03:00:00\| +-------------------+ ``` but the result must be 2019-11-03 04:00:00: <img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png"> - Simplifies the code, and make it more maintainable - Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions. ### Does this PR introduce any user-facing change? Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. - Added `convert time zones on a daylight saving day` to DateFunctionsSuite Closes #27474 from MaxGekk/port-convertTz-on-Java8-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:32:07 +08:00
Wenchen Fan	5a4c70b4e2	[SPARK-27986][SQL][FOLLOWUP] window aggregate function with filter predicate is not supported ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26656. We don't support window aggregate function with filter predicate yet and we should fail explicitly. Observable metrics has the same issue. This PR fixes it as well. ### Why are the changes needed? If we simply ignore filter predicate when we don't support it, the result is wrong. ### Does this PR introduce any user-facing change? yea, fix the query result. ### How was this patch tested? new tests Closes #27476 from cloud-fan/filter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-06 13:33:39 -08:00
Wenchen Fan	8ce58627eb	[SPARK-30719][SQL] do not log warning if AQE is intentionally skipped and add a config to force apply ### What changes were proposed in this pull request? Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally. This PR also add a config to not skip AQE. ### Why are the changes needed? It's not a warning at all if we intentionally skip AQE. ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs. Closes #27452 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-06 09:16:14 -08:00
yi.wu	368ee62a5d	[SPARK-27297][DOC][FOLLOW-UP] Improve documentation for various Scala functions ### What changes were proposed in this pull request? Add examples and parameter description for these Scala functions: * transform * exists * forall * aggregate * zip_with * transform_keys * transform_values * map_filter * map_zip_with ### Why are the changes needed? Better documentation for UX. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27449 from Ngone51/doc-funcs. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 20:34:29 +08:00
yi.wu	3f5b23340e	[SPARK-30744][SQL] Optimize AnalyzePartitionCommand by calculating location sizes in parallel ### What changes were proposed in this pull request? Use `CommandUtils.calculateTotalLocationSize` for `AnalyzePartitionCommand` in order to calculate location sizes in parallel. ### Why are the changes needed? For better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27471 from Ngone51/dev_calculate_in_parallel. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 20:20:44 +08:00
beliefer	c8ef1dee90	[SPARK-29108][SQL][TESTS][FOLLOWUP] Comment out no use test case and add 'insert into' statement of window.sql (Part 2) ### What changes were proposed in this pull request? When I running the `window_part2.sql` tests find it lack insert sql. Therefore, the output is empty. I checked the postgresql and reference https://github.com/postgres/postgres/blob/master/src/test/regress/sql/window.sql Although `window_part1.sql` and `window_part3.sql` exists the insert sql, I think should also add it into `window_part2.sql`. Because only one case reference the table `empsalary` and it throws `AnalysisException`. ``` -- !query select last(salary) over(order by salary range between 1000 preceding and 1000 following), lag(salary) over(order by salary range between 1000 preceding and 1000 following), salary from empsalary -- !query schema struct<> -- !query output org.apache.spark.sql.AnalysisException Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the required frame specifiedwindowframe(RowFrame, -1, -1); ``` So we should do four work: 1. comment out the only one case and create a new ticket. 2. Add `INSERT INTO empsalary`. Note: window_part4.sql not use the table `empsalary`. ### Why are the changes needed? Supplementary test data. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New test case Closes #27439 from beliefer/add-insert-to-window. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-06 15:24:26 +09:00
Terry Kim	c27a616450	[SPARK-30612][SQL] Resolve qualified column name with v2 tables ### What changes were proposed in this pull request? This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables. This PR would allow qualified column names in query as following: ```SQL SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl ``` ### Why are the changes needed? This is a bug because you cannot qualify column names in queries. ### Does this PR introduce any user-facing change? Yes, now users can qualify column names for v2 tables. ### How was this patch tested? Added new tests. Closes #27391 from imback82/qualified_col. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 13:54:17 +08:00
Wenchen Fan	3b26f807a0	[SPARK-30721][SQL][TESTS] Fix DataFrameAggregateSuite when enabling AQE ### What changes were proposed in this pull request? update `DataFrameAggregateSuite` to make it pass with AQE ### Why are the changes needed? We don't need to turn off AQE in `DataFrameAggregateSuite` ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `DataFrameAggregateSuite` locally with AQE on. Closes #27451 from cloud-fan/aqe-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 12:36:51 -08:00
Yuanjian Li	4938905a1c	[SPARK-29864][SQL][FOLLOWUP] Reference the config for the old behavior in error message ### What changes were proposed in this pull request? Follow up work for SPARK-29864, reference the config `spark.sql.legacy.fromDayTimeString.enabled` in error message. ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27464 from xuanyuanking/SPARK-29864-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 11:19:42 -08:00
turbofei	6d507b4a31	[SPARK-26218][SQL][FOLLOW UP] Fix the corner case when casting float to Integer ### What changes were proposed in this pull request? When spark.sql.ansi.enabled is true, for the statement: ``` select cast(cast(2147483648 as Float) as Integer) //result is 2147483647 ``` Its result is 2147483647 and does not throw `ArithmeticException`. The root cause is that, the below code does not work for some corner cases. `94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)` For example: ![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png) In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added Unit test. Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 21:24:02 +08:00
Maxim Gekk	459e757ed4	[SPARK-30668][SQL] Support `SimpleDateFormat` patterns in parsing timestamps/dates strings ### What changes were proposed in this pull request? In the PR, I propose to partially revert the commit `51a6ba0181`, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`. To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`. ### Why are the changes needed? To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added new test to `DateFunctionsSuite` - Restored additional test cases in `JsonInferSchemaSuite`. Closes #27441 from MaxGekk/support-simpledateformat. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 18:48:45 +08:00
Liang-Chi Hsieh	7631275f97	[SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser ### What changes were proposed in this pull request? This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string. ### Why are the changes needed? Adding a legacy config to make migration easy for Spark users. ### Does this PR introduce any user-facing change? Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0. ### How was this patch tested? Unit test. Closes #27456 from viirya/SPARK-25040-followup. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:22:23 -08:00
maryannxue	6097b343ba	[SPARK-30717][SQL] AQE subquery map should cache `SubqueryExec` instead of `ExecSubqueryExpression` ### What changes were proposed in this pull request? This PR is to fix a potential bug in AQE where an `ExecSubqueryExpression` could be mistakenly replaced with another `ExecSubqueryExpression` with the same `ListQuery` but a different `child` expression. This is because a ListQuery's id can only identify the ListQuery itself, not the parent expression `InSubquery`, but right now the `subqueryMap` in `InsertAdaptiveSparkPlan` uses the `ListQuery`'s id as key and the corresponding `InSubqueryExec` for the `ListQuery`'s parent expression as value. So the fix uses the corresponding `SubqueryExec` for the `ListQuery` itself as the map's value. ### Why are the changes needed? This logical bug could potentially cause a wrong query plan, which could throw an exception related to unresolved columns. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing UTs. Closes #27446 from maryannxue/spark-30717. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 12:31:44 +08:00
Yuanjian Li	a4912cee61	[SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf ### What changes were proposed in this pull request? Put the configs below needed by Structured Streaming UI into StaticSQLConf: - spark.sql.streaming.ui.enabled - spark.sql.streaming.ui.retainedProgressUpdates - spark.sql.streaming.ui.retainedQueries ### Why are the changes needed? Make all SS UI configs consistent with other similar configs in usage and naming. ### Does this PR introduce any user-facing change? Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`. ### How was this patch tested? Existing UT. Closes #27425 from xuanyuanking/SPARK-29543-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-02-02 23:37:13 -08:00
Burak Yavuz	2eccfd8a73	[SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView ### What changes were proposed in this pull request? Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog. ### Why are the changes needed? This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to DataSourceV2SessionCatalogSuite Closes #27423 from brkyvz/isViewF. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-03 14:08:59 +08:00
Liang-Chi Hsieh	8eecc20b11	[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" ## What changes were proposed in this pull request? This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table. For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL. For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration. It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported. For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table. ## How was this patch tested? Added test. Closes #24938 from viirya/SPARK-27946. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 19:55:25 -08:00
yi.wu	82b4f753a0	[SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource ### What changes were proposed in this pull request? This PR adds `SparkSession.executeCommand` API for external datasource to execute a random command like ``` val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions") ``` Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after `executeCommand` called and the returned `DataFrame` will contain the output of the command(if any). ### Why are the changes needed? This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as HyukjinKwon suggested). Previously, user needs to use an option to achieve the goal, e.g. `spark.read.format("xxxSource").option("command", "xxxCommand").load()`, which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal. ### Does this PR introduce any user-facing change? Yes, new API from `SparkSession` and a new interface `ExternalCommandRunnableProvider`. ### How was this patch tested? Added a new test suite. Closes #27199 from Ngone51/dev-executeCommand. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 15:05:26 -08:00
Maxim Gekk	2d4b5eaee4	[SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double ### What changes were proposed in this pull request? - Replace `new Integer(0)` by a serializable instance in RDD.scala - Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html ### Why are the changes needed? This fixes the following warnings: 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite Closes #27399 from MaxGekk/eliminate-warning-part4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 15:03:16 -06:00
Kousuke Saruta	18bc4e55ef	[SPARK-30684][WEBUI] Show the descripton of metrics for WholeStageCodegen in DAG viz ### What changes were proposed in this pull request? Added description for metrics shown in the WholeStageCodegen-node in DAG viz. This is before the change is applied. ![before-changed](https://user-images.githubusercontent.com/4736016/73469870-5cf16480-43ca-11ea-9a13-714083508a3b.png) And following is after change. ![after-fixing-layout](https://user-images.githubusercontent.com/4736016/73469364-983f6380-43c9-11ea-8b7e-ddab030d0270.png) For this change, I also modify the layout of DAG viz. Actually, I noticed it's not enough to just added the description. Following is without changing the layout. ![layout-is-broken](https://user-images.githubusercontent.com/4736016/73470178-cffadb00-43ca-11ea-86d7-aed109b105e6.png) ### Why are the changes needed? Users can't understand what those metrics mean. ### Does this PR introduce any user-facing change? Yes. The layout is a little bit changed. ### How was this patch tested? I confirm the result of DAG viz with following 3 operations. `sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show` `sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output")` `sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output")` Closes #27405 from sarutak/sql-dag-metrics. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-31 11:58:52 -08:00
Wenchen Fan	33546d637d	Revert "[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by" This reverts commit `a2de20c0e6`.	2020-02-01 03:02:52 +08:00
Jungtaek Lim (HeartSaVioR)	5e0faf9a3d	[SPARK-29779][SPARK-30479][CORE][SQL][FOLLOWUP] Reflect review comments on post-hoc review ### What changes were proposed in this pull request? This PR reflects review comments on post-hoc review among PRs for SPARK-29779 (#27085), SPARK-30479 (#27164). The list of review comments this PR addresses are below: * https://github.com/apache/spark/pull/27085#discussion_r373304218 * https://github.com/apache/spark/pull/27164#discussion_r373300793 * https://github.com/apache/spark/pull/27164#discussion_r373301193 * https://github.com/apache/spark/pull/27164#discussion_r373301351 I also applied review comments to the CORE module (BasicEventFilterBuilder.scala) as well, as the review comments for SQL/core module (SQLEventFilterBuilder.scala) can be applied there as well. ### Why are the changes needed? There're post-hoc reviews on PRs for such issues, like links in above section. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27414 from HeartSaVioR/SPARK-28869-SPARK-29779-SPARK-30479-FOLLOWUP-posthoc-reviews. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-31 10:17:07 -08:00
Tathagata Das	481e5211d2	[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits This PR solves two bugs related to streaming limits Bug 1 (SPARK-30658): Limit before a streaming aggregate (i.e. `df.limit(5).groupBy().count()`) in complete mode was not being planned as a stateful streaming limit. The planner rule planned a logical limit with a stateful streaming limit plan only if the query is in append mode. As a result, instead of allowing max 5 rows across batches, the planned streaming query was allowing 5 rows in every batch thus producing incorrect results. Solution: Change the planner rule to plan the logical limit with a streaming limit plan even when the query is in complete mode if the logical limit has no stateful operator before it. Bug 2 (SPARK-30657): `LocalLimitExec` does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. `df.dropDuplicates().limit(5)`), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed). Solution: Change the planner rule to always use a new `StreamingLocalLimitExec` which always fully consumes the iterator. This is the safest thing to do. However, this will introduce a performance regression as consuming the iterator is extra work. To minimize this performance impact, add an additional post-planner optimization rule to replace `StreamingLocalLimitExec` with `LocalLimitExec` when there is no stateful operator before the limit that could be affected by it. No Updated incorrect unit tests and added new ones Closes #27373 from tdas/SPARK-30657. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-01-31 09:27:34 -08:00
yi.wu	5ccbb38a71	[SPARK-29938][SQL][FOLLOW-UP] Improve AlterTableAddPartitionCommand All credit to Ngone51, Closes #27293. ### What changes were proposed in this pull request? This PR improves `AlterTableAddPartitionCommand` by: 1. adds an internal config for partitions batch size to avoid hard code 2. reuse `InMemoryFileIndex.bulkListLeafFiles` to perform parallel file listing to improve code reuse ### Why are the changes needed? Improve code quality. ### Does this PR introduce any user-facing change? Yes. We renamed `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `spark.sql.parallelFileListingInCommands.enabled` as a side effect of this change. ### How was this patch tested? Pass Jenkins. Closes #27413 from xuanyuanking/SPARK-29938. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-01 01:03:00 +08:00
Burak Yavuz	290a528bff	[SPARK-30615][SQL] Introduce Analyzer rule for V2 AlterTable column change resolution ### What changes were proposed in this pull request? Adds an Analyzer rule to normalize the column names used in V2 AlterTable table changes. We need to handle all ColumnChange operations. We add an extra match statement for future proofing new changes that may be added. This prevents downstream consumers (e.g. catalogs) to deal about case sensitivity or check that columns exist, etc. We also fix the behavior for ALTER TABLE CHANGE COLUMN (Hive style syntax) for adding comments to complex data types. Currently, the data type needs to be provided as part of the Hive style syntax. This assumes that the data type as changed when it may have not and the user only wants to add a comment, which fails in CheckAnalysis. ### Why are the changes needed? Currently we do not handle case sensitivity correctly for ALTER TABLE ALTER COLUMN operations. ### Does this PR introduce any user-facing change? No, fixes a bug. ### How was this patch tested? Introduced v2CommandsCaseSensitivitySuite and added a test around HiveStyle Change columns to PlanResolutionSuite Closes #27350 from brkyvz/normalizeAlter. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 16:41:10 +08:00
herman	a5c7090ffa	[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation ### What changes were proposed in this pull request? This PR makes `SparkSession.emptyDataFrame` use an empty local relation instead of an empty RDD. This allows to optimizer to recognize this as an empty relation, and creates the opportunity to do some more aggressive optimizations. ### Why are the changes needed? It allows us to optimize empty dataframes better. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a test case to `DataFrameSuite`. Closes #27400 from hvanhovell/SPARK-30671. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-31 16:14:07 +09:00
Burak Yavuz	1cd19ad92d	[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming ### What changes were proposed in this pull request? We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A ReadLimit defines how much data should be read in the next micro-batch. `SupportsAdmissionControl` specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example. We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream. ### Why are the changes needed? Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate limit the amount of data read for that micro-batch without this interface. ### Does this PR introduce any user-facing change? DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once(). ### How was this patch tested? Existing tests, as this API is mostly internal Closes #27380 from brkyvz/rateLimit. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-30 22:02:48 -08:00
sandeep katta	5f3ec6250f	[SPARK-30362][CORE] Update InputMetrics in DataSourceRDD ### What changes were proposed in this pull request? Incase of DS v2 InputMetrics are not updated Before Fix ![inputMetrics](https://user-images.githubusercontent.com/35216143/71501010-c216df00-288d-11ea-8522-fdd50b13eae1.png) After Fix we can see that `Input Size / Records` is updated in the UI ![image](https://user-images.githubusercontent.com/35216143/71501000-b88d7700-288d-11ea-92fe-a727b2b79908.png) ### Why are the changes needed? InputMetrics like bytesread and recordread should be updated ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT and also verified manually Closes #27021 from sandeep-katta/dsv2inputmetrics. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 14:01:32 +08:00
Wenchen Fan	9f42be25eb	[SPARK-29665][SQL] refine the TableProvider interface ### What changes were proposed in this pull request? Instead of having several overloads of `getTable` method in `TableProvider`, it's better to have 2 methods explicitly: `inferSchema` and `inferPartitioning`. With a single `getTable` method that takes everything: schema, partitioning and properties. This PR also adds a `supportsExternalMetadata` method in `TableProvider`, to indicate if the source support external table metadata. If this flag is false: 1. spark.read.schema... is disallowed and fails 2. when we support creating v2 tables in session catalog, spark only keeps table properties in the catalog. ### Why are the changes needed? API improvement. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26868 from cloud-fan/provider2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 13:37:43 +08:00
Jungtaek Lim (HeartSaVioR)	cbb714f67e	[SPARK-29438][SS] Use partition ID of StateStoreAwareZipPartitionsRDD for determining partition ID of state store in stream-stream join ### What changes were proposed in this pull request? Credit to uncleGen for discovering the problem and providing simple reproducer as UT. New UT in this patch is borrowed from #26156 and I'm retaining a commit from #26156 (except unnecessary part on this path) to properly give a credit. This patch fixes the issue that partition ID could be mis-assigned when the query contains UNION and stream-stream join is placed on the right side. We assume the range of partition IDs as `(0 ~ number of shuffle partitions - 1)` for stateful operators, but when we use stream-stream join on the right side of UNION, the range of partition ID of task goes to `(number of partitions in left side, number of partitions in left side + number of shuffle partitions - 1)`, which `number of partitions in left side` can be changed in some cases (new UT points out the one of the cases). The root reason of bug is that stream-stream join picks the partition ID from TaskContext, which wouldn't be same as partition ID from source if union is being used. Hopefully we can pick the right partition ID from source in StateStoreAwareZipPartitionsRDD - this patch leverages that partition ID. ### Why are the changes needed? This patch will fix the broken of assumption of partition range on stateful operator, as well as fix the issue reported in JIRA issue SPARK-29438. ### Does this PR introduce any user-facing change? Yes, if their query is using UNION and stream-stream join is placed on the right side. They may encounter the problem to read state from checkpoint and may need to discard checkpoint to continue. ### How was this patch tested? Added UT which fails on current master branch, and passes with this patch. Closes #26162 from HeartSaVioR/SPARK-29438. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-01-30 20:21:43 -08:00
Wenchen Fan	e5f572af06	[SPARK-30680][SQL] ResolvedNamespace does not require a namespace catalog ### What changes were proposed in this pull request? Update `ResolvedNamespace` to accept catalog as `CatalogPlugin` not `SupportsNamespaces`. This is extracted from https://github.com/apache/spark/pull/27345 ### Why are the changes needed? not all commands that need to resolve namespaces require a namespace catalog. For example, `SHOW TABLE` is implemented by `TableCatalog.listTables`, and is nothing to do with `SupportsNamespace`. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27403 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 10:34:59 -08:00
zero323	b1f81f0072	[MINOR][SQL][DOCS] Fix typos in scaladoc strings of higher order functions ### What changes were proposed in this pull request? Fix following typos: - tranformation -> transformation - the boolean -> the Boolean - signle -> single ### Why are the changes needed? ### Does this PR introduce any user-facing change? No ### How was this patch tested? Scala linter. Closes #27382 from zero323/functions-typos. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-29 18:42:18 -06:00
uncleGen	7173786153	[SPARK-29543][SS][UI] Structured Streaming Web UI ### What changes were proposed in this pull request? This PR adds two pages to Web UI for Structured Streaming: - "/streamingquery": Streaming Query Page, providing some aggregate information for running/completed streaming queries. - "/streamingquery/statistics": Streaming Query Statistics Page, providing detailed information for streaming query, including `Input Rate`, `Process Rate`, `Input Rows`, `Batch Duration` and `Operation Duration` ![Screen Shot 2020-01-29 at 1 38 00 PM](https://user-images.githubusercontent.com/1000778/73399837-cd01cc80-429c-11ea-9d4b-1d200a41b8d5.png) ![Screen Shot 2020-01-29 at 1 39 16 PM](https://user-images.githubusercontent.com/1000778/73399838-cd01cc80-429c-11ea-8185-4e56db6866bd.png) ### Why are the changes needed? It helps users to better monitor Structured Streaming query. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - new added and existing UTs - manual test Closes #26201 from uncleGen/SPARK-29543. Lead-authored-by: uncleGen <hustyugm@gmail.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Genmao Yu <hustyugm@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-01-29 13:43:51 -08:00
Takeshi Yamamuro	ec1fb6b4e1	[SPARK-30234][SQL][FOLLOWUP] Add `.enabled` in the suffix of the ADD FILE legacy option ### What changes were proposed in this pull request? This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`. ### Why are the changes needed? For consistent option names. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27372 from maropu/SPARK-30234-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-29 12:23:59 +09:00
Maxim Gekk	8aebc80e0e	[SPARK-30625][SQL] Support `escape` as third parameter of the `like` function ### What changes were proposed in this pull request? In the PR, I propose to transform the `Like` expression to `TernaryExpression`, and add third parameter `escape`. So, the `like` function will have feature parity with `LIKE ... ESCAPE` syntax supported by `187f3c1773`. ### Why are the changes needed? The `like` functions can be called with 2 or 3 parameters, and functionally equivalent to `LIKE` and `LIKE ... ESCAPE` SQL expressions. ### Does this PR introduce any user-facing change? Yes, before `like` fails with the exception: ```sql spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7 ``` After: ```sql spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); true ``` ### How was this patch tested? - Add new example for the `like` function which is checked by `SQLQuerySuite` - Run `RegexpExpressionsSuite` and `ExpressionParserSuite`. Closes #27355 from MaxGekk/like-3-args. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 11:19:32 -08:00
Jungtaek Lim (HeartSaVioR)	0436b3d3f8	[SPARK-30653][INFRA][SQL] EOL character enforcement for java/scala/xml/py/R files ### What changes were proposed in this pull request? This patch converts CR/LF into LF in 3 source files, which most files are only using LF. This patch also add rules to enforce EOL as LF for all java, scala, xml, py, R files. ### Why are the changes needed? The majority of source code files are using LF and only three files are CR/LF. While using IDE would let us don't bother with the difference, it still has a chance to make unnecessary diff if the file is modified with the editor which doesn't handle it automatically. ### Does this PR introduce any user-facing change? No ### How was this patch tested? ``` grep -IUrl --color "^M" . \| grep "\.java\\|\.scala\\|\.xml\\|\.py\\|\.R" \| grep -v "/target/" \| grep -v "/build/" \| grep -v "/dist/" \| grep -v "dependency-reduced-pom.xml" \| grep -v ".pyc" ``` (Please note you'll need to type CTRL+V -> CTRL+M in bash shell to get `^M` because it's representing CR/LF, not a combination of `^` and `M`.) Before the patch, the result is: ``` ./sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala ./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala ``` and after the patch, the result is None. And git shows WARNING message if EOL of any of source files in given types are modified to CR/LF, like below: ``` warning: CRLF will be replaced by LF in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala. The file will have its original line endings in your working directory. ``` Closes #27365 from HeartSaVioR/MINOR-remove-CRLF-in-source-codes. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 10:20:51 -08:00
Yuchen Huo	d0800fc8e2	[SPARK-30314] Add identifier and catalog information to DataSourceV2Relation ### What changes were proposed in this pull request? Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in checkAnalysis step. ### Why are the changes needed? In data source v2, table implementations are all customized so we may not be able to get the resolved identifier from tables them selves. Therefore we encode the table and catalog information in DSV2Relation so no external changes are needed to make sure this information is available. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests in the following suites: CatalogManagerSuite.scala CatalogV2UtilSuite.scala SupportsCatalogOptionsSuite.scala PlanResolutionSuite.scala Closes #26957 from yuchenhuo/SPARK-30314. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-26 12:59:24 -08:00
Xiao Li	48f647882a	[SPARK-30644][SQL][TEST] Remove query index from the golden files of SQLQueryTestSuite ### What changes were proposed in this pull request? This PR is to remove query index from the golden files of SQLQueryTestSuite ### Why are the changes needed? Because the SQLQueryTestSuite's golden files have the query index for each query, removal of any query statement [except the last one] will generate many unneeded difference. This will make code review harder. The number of changed lines is misleading. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27361 from gatorsmile/removeIndexNum. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 23:17:36 -08:00
Xiao Li	d69ed9afdf	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `1d20d13149`. Closes #27351 from gatorsmile/revertSPARK25496. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 21:34:12 -08:00
Liang-Chi Hsieh	a0e63b61e7	[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project ### What changes were proposed in this pull request? This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it. ### Why are the changes needed? In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #26978 from viirya/SPARK-29721. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-24 22:17:28 -08:00
Gengliang Wang	ed44926117	[SPARK-30627][SQL] Disable all the V2 file sources by default ### What changes were proposed in this pull request? Disable all the V2 file sources in Spark 3.0 by default. ### Why are the changes needed? There are still some missing parts in the file source V2 framework: 1. It doesn't support reporting file scan metrics such as "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. This requires another patch in the data source V2 framework. Tracked by [SPARK-30362](https://issues.apache.org/jira/browse/SPARK-30362) 2. It doesn't support partition pruning with subqueries(including dynamic partition pruning) for now. Tracked by [SPARK-30628](https://issues.apache.org/jira/browse/SPARK-30628) As we are going to code freeze on Jan 31st, this PR proposes to disable all the V2 file sources in Spark 3.0 by default. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #27348 from gengliangwang/disableFileSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-23 21:42:43 -08:00
Xiao Li	ddf83159a8	[SPARK-28962][SQL][FOLLOW-UP] Add the parameter description for the Scala function API filter ### What changes were proposed in this pull request? This PR is a follow-up PR https://github.com/apache/spark/pull/25666 for adding the description and example for the Scala function API `filter`. ### Why are the changes needed? It is hard to tell which parameter is the index column. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27336 from gatorsmile/spark28962. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-23 16:23:16 -08:00

1 2 3 4 5 ...

6465 commits