ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
maryannxue	8fabbab299	[SPARK-29350] Fix BroadcastExchange reuse in Dynamic Partition Pruning ### What changes were proposed in this pull request? Dynamic partition pruning filters are added as an in-subquery containing a `BroadcastExchangeExec` in case of a broadcast hash join. This PR makes the `ReuseExchange` rule visit in-subquery nodes, to ensure the new `BroadcastExchangeExec` added by dynamic partition pruning can be reused. ### Why are the changes needed? This initial dynamic partition pruning PR did not enable this reuse, which means a broadcast exchange would be executed twice, in the main query and in the DPP filter. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added broadcast exchange reuse check in `DynamicPartitionPruningSuite` Closes #26015 from maryannxue/exchange-reuse. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-03 16:11:32 -07:00
Nik Vanderhoof	6f687691ef	[SPARK-28962][SPARK-27297][SQL] Add overload for filter with index to functions object ### What changes were proposed in this pull request? Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. ### Why are the changes needed? See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: https://github.com/apache/spark/pull/24232#issuecomment-533288653 ### Does this PR introduce any user-facing change? ### How was this patch tested? Updated the these test suites: `test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite` and `org.apache.spark.sql.DataFrameFunctionsSuite` Closes #26007 from nvander1/add_index_overload_for_filter. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-03 11:12:14 -07:00
Dongjoon Hyun	4e0e4e51c4	[MINOR][TESTS] Rename JSONBenchmark to JsonBenchmark ### What changes were proposed in this pull request? This PR renames `object JSONBenchmark` to `object JsonBenchmark` and the benchmark result file `JSONBenchmark-results.txt` to `JsonBenchmark-results.txt`. ### Why are the changes needed? Since the file name doesn't match with `object JSONBenchmark`, it makes a confusion when we run the benchmark. In addition, this makes the automation difficult. ``` $ find . -name JsonBenchmark.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala ``` ``` $ build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JsonBenchmark" [info] Running org.apache.spark.sql.execution.datasources.json.JsonBenchmark [error] Error: Could not find or load main class org.apache.spark.sql.execution.datasources.json.JsonBenchmark ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is just renaming. Closes #26008 from dongjoon-hyun/SPARK-RENAME-JSON. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 09:02:06 -07:00
Dongjoon Hyun	854a0f752e	[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. A. EXPECTED CASES(JDK11 is faster in general) - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) B. CASES WE NEED TO INVESTIGATE MORE LATER - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 08:58:25 -07:00
Sean Owen	7aca0dd658	[SPARK-29296][BUILD][CORE] Remove use of .par to make 2.13 support easier; add scala-2.13 profile to enable pulling in par collections library separately, for the future ### What changes were proposed in this pull request? Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use. However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely. ### Why are the changes needed? To compile for 2.13 and later to work with 2.13. ### Does this PR introduce any user-facing change? Should not, no. ### How was this patch tested? Existing tests. Closes #25980 from srowen/SPARK-29296. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-03 08:56:08 -05:00
s71955	ee66890f30	[SPARK-28084][SQL] Resolving the partition column name based on the resolver in sql load command ### What changes were proposed in this pull request? LOAD DATA command resolves the partition column name as case sensitive manner, where as in insert commandthe partition column name will be resolved using the SQLConf resolver where the names will be resolved based on `spark.sql.caseSensitive` property. Same logic can be applied for resolving the partition column names in LOAD COMMAND. ### Why are the changes needed? It's to handle the partition column name correctly according to the configuration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT and manual testing. Closes #24903 from sujith71955/master_paritionColName. Lead-authored-by: s71955 <sujithchacko.2010@gmail.com> Co-authored-by: sujith71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 01:11:48 -07:00
HyukjinKwon	40485f4656	[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan ### What changes were proposed in this pull request? This PR proposes to avoid abstract classes introduced at https://github.com/apache/spark/pull/24965 but instead uses trait and object. - `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in Before: ``` BasePythonRunner ├── BaseArrowPythonRunner │ ├── ArrowPythonRunner │ └── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` After: ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` - `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple Before: ``` └── BasePandasGroupExec ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` After: ``` ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` ### Why are the changes needed? The problem is that R code path is being matched with Python side: Python: ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` R: ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally. `BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs. `FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec` `MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec` In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Locally tested existing tests. Jenkins tests should verify this too. Closes #25989 from HyukjinKwon/SPARK-29317. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-03 16:42:37 +09:00
Henry D	51d6ba7490	[SPARK-28962][SQL] Provide index argument to filter lambda functions ### What changes were proposed in this pull request? Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`. ### Why are the changes needed? See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays. ### Does this PR introduce any user-facing change? Previously filter lambdas had to look like `filter(arr, el -> whatever)` Now, lambdas can take an index argument as well `filter(array, (el, idx) -> whatever)` ### How was this patch tested? I added unit tests to `HigherOrderFunctionsSuite`. Closes #25666 from henrydavidge/filter-idx. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 13:03:06 -07:00
Nik Vanderhoof	730a17823f	[SPARK-27297][SQL] Add higher order functions to scala API ## What changes were proposed in this pull request? There is currently no existing Scala API equivalent for the higher order functions introduced in Spark 2.4.0. * transform * aggregate * filter * exists * forall * zip_with * map_zip_with * map_filter * transform_values * transform_keys Equivalent column based functions should be added to the Scala API for org.apache.spark.sql.functions with the following signatures: ```scala def transform(column: Column, f: Column => Column): Column = ??? def transform(column: Column, f: (Column, Column) => Column): Column = ??? def exists(column: Column, f: Column => Column): Column = ??? def filter(column: Column, f: Column => Column): Column = ??? def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column, finish: Column => Column): Column = ??? def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column): Column = ??? def zip_with( left: Column, right: Column, f: (Column, Column) => Column): Column = ??? def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ??? def transform_values(expr: Column, f: (Column, Column) => Column): Column = ??? def map_filter(expr: Column, f: (Column, Column) => Column): Column = ??? def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column = ??? ``` ## How was this patch tested? I've mimicked the existing tests for the higher order functions in `org.apache.spark.sql.DataFrameFunctionsSuite` that use `expr` to test the higher order functions. As an example of an existing test: ```scala test("map_zip_with function - map of primitive types") { val df = Seq( (Map(8 -> 6L, 3 -> 5L, 6 -> 2L), Map[Integer, Integer]((6, 4), (8, 2), (3, 2))), (Map(10 -> 6L, 8 -> 3L), Map[Integer, Integer]((8, 4), (4, null))), (Map.empty[Int, Long], Map[Integer, Integer]((5, 1))), (Map(5 -> 1L), null) ).toDF("m1", "m2") checkAnswer(df.selectExpr("map_zip_with(m1, m2, (k, v1, v2) -> k == v1 + v2)"), Seq( Row(Map(8 -> true, 3 -> false, 6 -> true)), Row(Map(10 -> null, 8 -> false, 4 -> null)), Row(Map(5 -> null)), Row(null))) } ``` I've added this test that performs the same logic, but with the new column based API I've added. ```scala checkAnswer(df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2)), Seq( Row(Map(8 -> true, 3 -> false, 6 -> true)), Row(Map(10 -> null, 8 -> false, 4 -> null)), Row(Map(5 -> null)), Row(null))) ``` Closes #24232 from nvander1/feature/add_higher_order_functions_to_scala_api. Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Co-authored-by: Nik <nikolasrvanderhoof@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 12:53:39 -07:00
Terry Kim	f2ead4d0b5	[SPARK-28970][SQL] Implement USE CATALOG/NAMESPACE for Data Source V2 ### What changes were proposed in this pull request? This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#) It also exposes `currentCatalog` in `CatalogManager`. Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog). ### Why are the changes needed? There is currently no mechanism to change current catalog/namespace thru SQL commands. ### Does this PR introduce any user-facing change? Yes, you can perform the following: ```scala // Sets the current catalog to 'testcat' spark.sql("USE CATALOG testcat") // Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'. spark.sql("USE ns1.ns2 IN testcat") // Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace. spark.sql("SHOW NAMESPACES") ``` ### How was this patch tested? Added new unit tests. Closes #25771 from imback82/use_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-02 21:55:21 +08:00
Maxim Gekk	3b1674cb1f	[SPARK-29313][SQL] Fix failure on writing to `noop` in benchmarks ### What changes were proposed in this pull request? In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks: - JsonBenchmark - CSVBenchmark - UDFBenchmark - MakeDateTimeBenchmark - ExtractBenchmark - DateTimeBenchmark - NestedSchemaPruningBenchmark ### Why are the changes needed? Otherwise writing to `noop` fails with: ``` [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.; [error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284) ``` most likely due to https://github.com/apache/spark/pull/25876 ### Does this PR introduce any user-facing change? No ### How was this patch tested? I generated results of `ExtractBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark" ``` Closes #25988 from MaxGekk/noop-overwrite-mode. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-01 21:04:56 -07:00
Maxim Gekk	e13880128d	[SPARK-29311][SQL] Return seconds with fraction from `date_part()` and `extract` ### What changes were proposed in this pull request? Added new expression `SecondWithFraction` which produces the `seconds` part of timestamps/dates with fractional part containing microseconds. This expression is used only in the `DatePart` expression. As the result, `date_part()` and `extract` return seconds and microseconds as the fractional part of the seconds part when `field` is `SECOND` (or synonyms). ### Why are the changes needed? The `date_part()` and `extract` were added to maintain feature parity with PostgreSQL which has different behavior for the `SECOND` value of the `field` parameter. The fix is needed to behave in the same way. Here is PostgreSQL's output: ```sql # SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001'); date_part ----------- 1.000001 (1 row) ``` ### Does this PR introduce any user-facing change? Yes, type of `date_part('SECOND', ...)` is changed from `INT` to `DECIMAL(8, 6)`. Before: ```sql spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001'); 1 ``` After: ```sql spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001'); 1.000001 ``` ### How was this patch tested? - Added new tests to `DateExpressionSuite` for the `SecondWithFraction` expression - Regenerated results of `date_part.sql`, `extract.sql` and `timestamp.sql` - Updated results of `ExtractBenchmark` Closes #25986 from MaxGekk/extract-seconds-from-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-02 11:16:31 +09:00
angerszhu	0cf2f48dfe	[SPARK-29022][SQL] Fix SparkSQLCLI can not add jars by AddJarCommand ### What changes were proposed in this pull request? For issue mentioned in [SPARK-29022](https://issues.apache.org/jira/browse/SPARK-29022) Spark SQL CLI can't use class as serde class in jars add by SQL `ADD JAR`. When we create table with `serde` class contains by jar added by SQL 'ADD JAR'. We can create table with `serde` class construct success since we call `HiveClientImpl.createTable` under `withHiveState` method, it will add `clientLoader.classLoader` to `HiveClientImpl.state.getConf.classLoader`. Jars added by SQL `ADD JAR` will be add to 1. `sparkSession.sharedState.jarClassLoader`. 2. 'HiveClientLoader.clientLoader.classLoader' In Current spark-sql MODE, `HiveClientImpl.state` will use CliSessionState created when initialize SparkSQLCliDriver, When we select data from table, it will check `serde` class, when call method `HiveTableScanExec#addColumnMetadataToConf()` to check for table desc serde class. ``` val deserializer = tableDesc.getDeserializerClass.getConstructor().newInstance() deserializer.initialize(hiveConf, tableDesc.getProperties) ``` `getDeserializer` will use CliSessionState's hiveConf's classLoader in `Spark SQL CLI` mode. But when we call `ADD JAR` in spark, the jar won't be added to `Classloader of CliSessionState' conf `, then `ClassNotFound` error happen. So we reset `CliSessionState conf's classLoader ` to `sharedState.jarClassLoader` when `sharedState.jarClassLoader` has added jar passed by `HIVEAUXJARS` Then when we use `ADD JAR ` to add jar, jar path will be added to CliSessionState's conf's ClassLoader ### Why are the changes needed? Fix bug ### Does this PR introduce any user-facing change? No ### How was this patch tested? ADD UT Closes #25729 from AngersZhuuuu/SPARK-29015. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-01 10:09:29 -05:00
Dongjoon Hyun	bd031c2173	[SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings ### What changes were proposed in this pull request? This PR aims to remove `scalatest` deprecation warnings with the following changes. - `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar` - `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser` - `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers` - `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks` ### Why are the changes needed? According to the Jenkins logs, there are 118 warnings about this. ``` grep "is deprecated" ~/consoleText \| grep scalatest \| wc -l 118 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? After Jenkins passes, we need to check the Jenkins log. Closes #25982 from dongjoon-hyun/SPARK-29307. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 21:00:11 -07:00
Jeff Evans	d841b33ba3	[SPARK-25153][SQL] Improve error messages for columns with dots/periods ### What changes were proposed in this pull request? Check schema fields to see if they contain the exact column name, add to error message in DataSet#resolve Add test for extra error message piece Adds an additional check in `DataSet#resolve`, in the else clause (i.e. column not resolved), that appends a suffix to the error message for the `AnalysisException` if that column name is literally found in the schema fields, to suggest to the user that it might need to be quoted via backticks. ### Why are the changes needed? Forgetting to quote such column names is a common occurrence for new Spark users. ### Does this PR introduce any user-facing change? No (other than the extra suffix on the error message). ### How was this patch tested? `test` was run for `core` in `sbt`, and passed. Closes #25807 from jeff303/SPARK-25153. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2019-09-30 18:34:44 -07:00
Dongjoon Hyun	a0b3d7a323	[SPARK-29300][TESTS] Compare `catalyst` and `avro` module benchmark in JDK8/11 ### What changes were proposed in this pull request? This PR regenerate the benchmark results in `catalyst` and `avro` module in order to compare JDK8/JDK11 result. ### Why are the changes needed? This PR aims to verify that there is no regression on JDK11. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only update. We need to run the benchmark manually. Closes #25972 from dongjoon-hyun/SPARK-29300. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 17:59:43 -07:00
Sean Owen	e1ea806b30	[SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13 ### What changes were proposed in this pull request? Scala 2.13 emits a deprecation warning for procedure-like declarations: ``` def foo() { ... ``` This is equivalent to the following, so should be changed to avoid a warning: ``` def foo(): Unit = { ... ``` ### Why are the changes needed? It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files. Unfortunately, that makes this quite a big change. ### Does this PR introduce any user-facing change? No behavior change at all. ### How was this patch tested? Existing tests. Closes #25968 from srowen/SPARK-29291. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 10:03:23 -07:00
Chris Martin	76791b89f5	[SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon. Specifically: - Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated. - Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key. - Added the boilerplate for doctests to cogroup.py. Note that cogroup.py only contains the apply() function which has doctests disabled as per the other Pandas Udfs. - Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package. - Some minor formatting tweaks. This was tested by running the appropriate unit tests. I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check. Closes #25939 from d80tb7/SPARK-27463-fixes. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-30 22:25:35 +09:00
Jungtaek Lim (HeartSaVioR)	39eb79ac4b	[SPARK-28074][SS] Log warn message on possible correctness issue for multiple stateful operations in single query ## What changes were proposed in this pull request? Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch. This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users. This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative. ## How was this patch tested? Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query. Closes #24890 from HeartSaVioR/SPARK-28074. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-30 08:18:23 -05:00
Liang-Chi Hsieh	dd92e15301	[SPARK-29186][SQL] AliasIdentifier should be converted to Json in prettyJson ### What changes were proposed in this pull request? This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson. ### Why are the changes needed? When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias", "num-children" : 1, "name" : null, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", ... ``` Seems the alias name was in the Json before SPARK-19602. It is fixed by this patch: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias", "num-children" : 1, "name" : { "product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier", "identifier" : "t1" }, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", ... ``` ### Does this PR introduce any user-facing change? Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier. ### How was this patch tested? Added unit test. Closes #25959 from viirya/SPARK-29186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-29 20:00:13 -07:00
Unknown	3ea9d6825b	[SPARK-29019][WEBUI] Improve tooltip JDBC/ODBC Server tab ### What changes were proposed in this pull request? Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand. We have documented it at SPARK-28373 but I think it is better to have some tooltips in the SQL statistics table to explain the columns ![image](https://user-images.githubusercontent.com/12819544/64489775-38e48980-d257-11e9-868a-5f5f6a0f1e46.png) The columns with new tooltips are finish time, close time, execution time and duration ![image](https://user-images.githubusercontent.com/12819544/64489858-1141f100-d258-11e9-9e4e-fae3299da465.png) Improvements in UIUtils can be used in other tables in WebUI to add tooltips ### Why are the changes needed? It is interesting to improve the undestanding of the WebUI ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests are added and manual test. Closes #25723 from planga82/feature/SPARK-29019_tooltipjdbcServer. Lead-authored-by: Unknown <soypab@gmail.com> Co-authored-by: Pablo <soypab@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-29 18:34:24 -05:00
angerszhu	1d4b2f010b	[SPARK-29247][SQL] Redact sensitive information in when construct HiveClientHive.state ### What changes were proposed in this pull request? HiveClientImpl may be log sensitive information. e.g. url, secret and token: ```scala logDebug( s""" \|Applying Hadoop/Hive/Spark and extra properties to Hive Conf: \|$k=${if (k.toLowerCase(Locale.ROOT).contains("password")) "xxx" else v} """.stripMargin) ``` So redact it. Use SQLConf.get.redactOptions. I add a new overloading function to fit this situation for one by one kv pair situation. ### Why are the changes needed? Redact sensitive information when construct HiveClientImpl ### Does this PR introduce any user-facing change? No ### How was this patch tested? MT Run command ` /sbin/start-thriftserver.sh` In log we can get ``` 19/09/28 08:27:02 main DEBUG HiveClientImpl: Applying Hadoop/Hive/Spark and extra properties to Hive Conf: hive.druid.metadata.password=*********(redacted) ``` Closes #25954 from AngersZhuuuu/SPARK-29247. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-29 14:30:32 -07:00
Yuming Wang	31700116d2	[SPARK-28476][SQL] Support ALTER DATABASE SET LOCATION ### What changes were proposed in this pull request? Support the syntax of `ALTER (DATABASE\|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax. Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL https://issues.apache.org/jira/browse/HIVE-8472 ### Why are the changes needed? Support more syntax. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25883 from wangyum/SPARK-28476. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-29 11:31:49 -07:00
TomokoKomiyama	67d5b9b157	[SPARK-29172][SQL] Fix some exception issue of explain commands ### What changes were proposed in this pull request? Added try exception ### Why are the changes needed? The behaviors of run commands during exception handling are different depends on explain command. I think it should be unified. [ >spark.sql("explain cost select * from hoge").show(false) ] ![cost](https://user-images.githubusercontent.com/55128575/65225389-09a80500-db00-11e9-9246-0f1a3a881595.png) [ >spark.sql("explain extended select * from hoge").show(false) ] ![extemded](https://user-images.githubusercontent.com/55128575/65225430-188eb780-db00-11e9-99bf-ff550b2ffd12.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? tested manually Closes #25848 from TomokoKomiyama/fix-explain. Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-29 10:41:57 -05:00
Yuming Wang	8167714cab	[SPARK-27831][FOLLOW-UP][SQL][TEST] Should not use maven to add Hive test jars ### What changes were proposed in this pull request? This PR moves Hive test jars(`hive-contrib-.jar` and `hive-hcatalog-core-.jar`) from maven dependency to local file. ### Why are the changes needed? `--jars` can't be tested since `hive-contrib-.jar` and `hive-hcatalog-core-.jar` are already in classpath. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test Closes #25690 from wangyum/SPARK-27831-revert. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-28 16:55:49 -07:00
Maxim Gekk	2409320d8f	[SPARK-29237][SQL][FOLLOWUP] Ignore `SET` commands in expression examples while checking the _FUNC_ pattern ### What changes were proposed in this pull request? The `SET` commands do not contain the `_FUNC_` pattern a priori. In the PR, I propose filter out such commands in the `using _FUNC_ instead of function names in examples` test. ### Why are the changes needed? After the merge of https://github.com/apache/spark/pull/25942, examples will require particular settings. Currently, the whole expression example has to be ignored which is so much. It makes sense to ignore only `SET` commands in expression examples. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the `using _FUNC_ instead of function names in examples` test. Closes #25958 from MaxGekk/dont-check-_FUNC_-in-set. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 08:51:47 +09:00
Jungtaek Lim (HeartSaVioR)	94946e4836	[SPARK-29281][SQL] Correct example of Like/RLike to test the origin intention correctly ### What changes were proposed in this pull request? This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default. Please take a look at current example of Like: `d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)` If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail. ``` The escape character is '\'. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character. ``` For the query ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%'; ``` SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike): > LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%` which are no longer having origin intention (see left). Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%'; ``` > LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%` Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`. Same for RLIKE: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` which is OK, but ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.'; ``` > RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.` which no longer haves origin intention. Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` ### Why are the changes needed? Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue. ### Does this PR introduce any user-facing change? No, as it only corrects documentation. ### How was this patch tested? Added debug log (like above) and ran queries from `spark-sql`. Closes #25957 from HeartSaVioR/SPARK-29281. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 03:05:49 +09:00
Maxim Gekk	ece4213176	[SPARK-21914][FOLLOWUP][TEST-HADOOP3.2][TEST-JAVA11] Clone SparkSession per each function example ### What changes were proposed in this pull request? In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel. ### Why are the changes needed? This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example: ``` org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U'; at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48) at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57) at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite` Closes #25956 from MaxGekk/fix-expr-examples-checks. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 02:57:55 +09:00
Jungtaek Lim (HeartSaVioR)	d72f39897b	[SPARK-27254][SS] Cleanup complete but invalid output files in ManifestFileCommitProtocol if job is aborted ## What changes were proposed in this pull request? SPARK-27210 enables ManifestFileCommitProtocol to clean up incomplete output files in task level if task is aborted. This patch extends the area of cleaning up, proposes ManifestFileCommitProtocol to clean up complete but invalid output files in job level if job aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol. ## How was this patch tested? Added UT. Closes #24186 from HeartSaVioR/SPARK-27254. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-09-27 12:35:26 -07:00
angerszhu	cc852d4eec	[SPARK-29015][SQL][TEST-HADOOP3.2] Reset class loader after initializing SessionState for built-in Hive 2.3 ### What changes were proposed in this pull request? Hive 2.3 will set a new UDFClassLoader to hiveConf.classLoader when initializing SessionState since HIVE-11878, and 1. ADDJarCommand will add jars to clientLoader.classLoader. 2. --jar passed jar will be added to clientLoader.classLoader 3. jar passed by hive conf `hive.aux.jars` [SPARK-28954](https://github.com/apache/spark/pull/25653) [SPARK-28840](https://github.com/apache/spark/pull/25542) will be added to clientLoader.classLoader too For these reason we cannot load the jars added by ADDJarCommand because of class loader got changed. We reset it to clientLoader.ClassLoader here. ### Why are the changes needed? support for jdk11 ### Does this PR introduce any user-facing change? NO ### How was this patch tested? UT ``` export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt -Phive-thriftserver -Phadoop-3.2 hive/test-only HiveSparkSubmitSuite -- -z "SPARK-8368: includes jars passed in through --jars" hive-thriftserver/test-only HiveThriftBinaryServerSuite -- -z "test add jar" ``` Closes #25775 from AngersZhuuuu/SPARK-29015-STS-JDK11. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-27 10:23:56 -05:00
Maxim Gekk	4dd0066d40	[SPARK-21914][SQL][TESTS] Check results of expression examples ### What changes were proposed in this pull request? New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different. ### Why are the changes needed? This prevents mistakes in expression examples, and fixes existing mistakes in comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new test to `SQLQuerySuite`. Closes #25942 from MaxGekk/run-expr-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-27 21:30:37 +09:00
Wang Shuo	bd28e8e179	[SPARK-29213][SQL] Generate extra IsNotNull predicate in FilterExec ### What changes were proposed in this pull request? Currently the behavior of getting output and generating null checks in `FilterExec` is different. Thus some nullable attribute could be treated as not nullable by mistake. In `FilterExec.ouput`, an attribute is marked as nullable or not by finding its `exprId` in notNullAttributes: ``` a.nullable && notNullAttributes.contains(a.exprId) ``` But in `FilterExec.doConsume`, a `nullCheck` is generated or not for a predicate is decided by whether there is semantic equal not null predicate: ``` val nullChecks = c.references.map { r => val idx = notNullPreds.indexWhere { n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} if (idx != -1 && !generatedIsNotNullChecks(idx)) { generatedIsNotNullChecks(idx) = true // Use the child's output. The nullability is what the child produced. genPredicate(notNullPreds(idx), input, child.output) } else { "" } }.mkString("\n").trim ``` NPE will happen when run the SQL below: ``` sql("create table table1(x string)") sql("create table table2(x bigint)") sql("create table table3(x string)") sql("insert into table2 select null as x") sql( """ \|select t1.x \|from ( \| select x from table1) t1 \|left join ( \| select x from ( \| select x from table2 \| union all \| select substr(x,5) x from table3 \| ) a \| where length(x)>0 \|) t3 \|on t1.x=t3.x """.stripMargin).collect() ``` NPE Exception: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:40) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:135) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:449) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` the generated code: ``` == Subtree 4 / 5 == (2) Project [cast(x#7L as string) AS x#9] +- (2) Filter ((length(cast(x#7L as string)) > 0) AND isnotnull(cast(x#7L as string))) +- Scan hive default.table2 [x#7L], HiveTableRelation `default`.`table2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#7L] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 011 / / 012 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 021 / filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 022 / / 023 / } / 024 / / 025 / protected void processNext() throws java.io.IOException { / 026 / while ( inputadapter_input_0.hasNext()) { / 027 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 028 / / 029 / do { / 030 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 031 / long inputadapter_value_0 = inputadapter_isNull_0 ? / 032 / -1L : (inputadapter_row_0.getLong(0)); / 033 / / 034 / boolean filter_isNull_2 = inputadapter_isNull_0; / 035 / UTF8String filter_value_2 = null; / 036 / if (!inputadapter_isNull_0) { / 037 / filter_value_2 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 038 / } / 039 / int filter_value_1 = -1; / 040 / filter_value_1 = (filter_value_2).numChars(); / 041 / / 042 / boolean filter_value_0 = false; / 043 / filter_value_0 = filter_value_1 > 0; / 044 / if (!filter_value_0) continue; / 045 / / 046 / boolean filter_isNull_6 = inputadapter_isNull_0; / 047 / UTF8String filter_value_6 = null; / 048 / if (!inputadapter_isNull_0) { / 049 / filter_value_6 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 050 / } / 051 / if (!(!filter_isNull_6)) continue; / 052 / / 053 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 054 / / 055 / boolean project_isNull_0 = false; / 056 / UTF8String project_value_0 = null; / 057 / if (!false) { / 058 / project_value_0 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 059 / } / 060 / filter_mutableStateArray_0[1].reset(); / 061 / / 062 / filter_mutableStateArray_0[1].zeroOutNullBytes(); / 063 / / 064 / if (project_isNull_0) { / 065 / filter_mutableStateArray_0[1].setNullAt(0); / 066 / } else { / 067 / filter_mutableStateArray_0[1].write(0, project_value_0); / 068 / } / 069 / append((filter_mutableStateArray_0[1].getRow())); / 070 / / 071 / } while(false); / 072 / if (shouldStop()) return; / 073 / } / 074 / } / 075 / / 076 / } ``` This PR proposes to use semantic comparison both in `FilterExec.output` and `FilterExec.doConsume` for nullable attribute. With this PR, the generated code snippet is below: ``` == Subtree 2 / 5 == (3) Project [substring(x#8, 5, 2147483647) AS x#5] +- (3) Filter ((length(substring(x#8, 5, 2147483647)) > 0) AND isnotnull(substring(x#8, 5, 2147483647))) +- Scan hive default.table3 [x#8], HiveTableRelation `default`.`table3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#8] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage3(references); / 003 / } / 004 / / 005 / // codegenStageId=3 / 006 / final class GeneratedIteratorForCodegenStage3 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 011 / / 012 / public GeneratedIteratorForCodegenStage3(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 021 / filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 022 / / 023 / } / 024 / / 025 / protected void processNext() throws java.io.IOException { / 026 / while ( inputadapter_input_0.hasNext()) { / 027 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 028 / / 029 / do { / 030 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 031 / UTF8String inputadapter_value_0 = inputadapter_isNull_0 ? / 032 / null : (inputadapter_row_0.getUTF8String(0)); / 033 / / 034 / boolean filter_isNull_0 = true; / 035 / boolean filter_value_0 = false; / 036 / boolean filter_isNull_2 = true; / 037 / UTF8String filter_value_2 = null; / 038 / / 039 / if (!inputadapter_isNull_0) { / 040 / filter_isNull_2 = false; // resultCode could change nullability. / 041 / filter_value_2 = inputadapter_value_0.substringSQL(5, 2147483647); / 042 / / 043 / } / 044 / boolean filter_isNull_1 = filter_isNull_2; / 045 / int filter_value_1 = -1; / 046 / / 047 / if (!filter_isNull_2) { / 048 / filter_value_1 = (filter_value_2).numChars(); / 049 / } / 050 / if (!filter_isNull_1) { / 051 / filter_isNull_0 = false; // resultCode could change nullability. / 052 / filter_value_0 = filter_value_1 > 0; / 053 / / 054 / } / 055 / if (filter_isNull_0 \|\| !filter_value_0) continue; / 056 / boolean filter_isNull_8 = true; / 057 / UTF8String filter_value_8 = null; / 058 / / 059 / if (!inputadapter_isNull_0) { / 060 / filter_isNull_8 = false; // resultCode could change nullability. / 061 / filter_value_8 = inputadapter_value_0.substringSQL(5, 2147483647); / 062 / / 063 / } / 064 / if (!(!filter_isNull_8)) continue; / 065 / / 066 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 067 / / 068 / boolean project_isNull_0 = true; / 069 / UTF8String project_value_0 = null; / 070 / / 071 / if (!inputadapter_isNull_0) { / 072 / project_isNull_0 = false; // resultCode could change nullability. / 073 / project_value_0 = inputadapter_value_0.substringSQL(5, 2147483647); / 074 / / 075 / } / 076 / filter_mutableStateArray_0[1].reset(); / 077 / / 078 / filter_mutableStateArray_0[1].zeroOutNullBytes(); / 079 / / 080 / if (project_isNull_0) { / 081 / filter_mutableStateArray_0[1].setNullAt(0); / 082 / } else { / 083 / filter_mutableStateArray_0[1].write(0, project_value_0); / 084 / } / 085 / append((filter_mutableStateArray_0[1].getRow())); / 086 / / 087 / } while(false); / 088 / if (shouldStop()) return; / 089 / } / 090 / } / 091 / / 092 */ } ``` ### Why are the changes needed? Fix NPE bug in FilterExec. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new UT Closes #25902 from wangshuo128/filter-codegen-npe. Authored-by: Wang Shuo <wangshuo128@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-27 15:14:17 +08:00
Yuanjian Li	ada3ad34c6	[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable ### What changes were proposed in this pull request? Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror. ### Why are the changes needed? We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars, end-users can set this config if the default maven central repo is unreachable. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #25849 from xuanyuanking/SPARK-29175. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 20:57:44 -07:00
uncleGen	570525f886	[SPARK-27715][SQL][UI] SQL query details in UI does not show in correct format ## What changes were proposed in this pull request? before pr: ![image](https://user-images.githubusercontent.com/7402327/57752168-bb7e9180-771a-11e9-8757-63236ecab753.png) after pr: ![image](https://user-images.githubusercontent.com/7402327/57752175-c802ea00-771a-11e9-96fd-aef1890b7985.png) ## How was this patch tested? manual test Closes #24609 from uncleGen/SPARK-27715. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-26 22:52:22 -05:00
Rahij Ramsharan	9f3c82163a	[SPARK-29259][SQL] call fs.exists only when necessary ### What changes were proposed in this pull request? Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand. ### Why are the changes needed? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests should cover it since this doesn't change the behavior. Closes #25928 from rahij/rr/exists-upstream. Authored-by: Rahij Ramsharan <rramsharan@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 15:46:31 -07:00
Gengliang Wang	a1213d5f96	[SPARK-28997][SQL] Add `spark.sql.dialect` ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 21:00:27 +08:00
Gengliang Wang	66c9dc316a	[SPARK-29255][SQL][TESTS] Rename package pgSQL to postgreSQL ### What changes were proposed in this pull request? Rename the package pgSQL to postgreSQL ### Why are the changes needed? To address the comment in https://github.com/apache/spark/pull/25697#discussion_r328431070 . The official full name seems more reasonable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25936 from gengliangwang/renamePGSQL. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-26 05:36:15 -07:00
Burak Yavuz	c8159c7941	[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter ### What changes were proposed in this pull request? It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR. ### Why are the changes needed? Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark. ### Does this PR introduce any user-facing change? It changes the default save mode for V2 Tables in the DataFrameWriter APIs ### How was this patch tested? Existing tests Closes #25876 from brkyvz/removeSM. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 15:20:04 +08:00
Liang-Chi Hsieh	b8b59d6fa3	[SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression ### What changes were proposed in this pull request? This patch proposes to skip PlanExpression when doing subexpression elimination on executors. ### Why are the changes needed? Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields. The NPE looks like: ``` [info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression * FAILED * (175 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException [info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548) [info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36) [info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added unit test. Closes #25925 from viirya/SPARK-29239. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 13:55:01 +08:00
Ryan Blue	6a4235aee7	[SPARK-29249][SQL] V2 writer: Don't allow tableProperty for existing tables ### What changes were proposed in this pull request? Don't allow calling append, overwrite, or overwritePartitions after tableProperty is used in DataFrameWriterV2 because table properties are not set as part of operations on existing tables. Only tables that are created or replaced can set table properties. ### Why are the changes needed? The properties are discarded otherwise, so this avoids confusing behavior. ### Does this PR introduce any user-facing change? Yes, but to a new API, DataFrameWriterV2. ### How was this patch tested? Removed test cases that used this method and the append, etc. methods because they no longer compile. Closes #25931 from rdblue/fix-dfw-v2-table-properties. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 12:41:34 +08:00
Maxim Gekk	21db2f86f7	[SPARK-29237][SQL] Prevent real function names in expression example template ### What changes were proposed in this pull request? In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples. ### Why are the changes needed? Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`. Closes #25924 from MaxGekk/fix-func-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-25 15:16:00 -07:00
Wenchen Fan	a36a7235db	[SPARK-29215][SQL] current namespace should be tracked in SessionCatalog if the current catalog is session catalog ### What changes were proposed in this pull request? when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`. ### Why are the changes needed? It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`. Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog. Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Newly added and updated test cases. Closes #25903 from cloud-fan/current. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-09-25 17:01:36 +08:00
WeichenXu	d8b0914c2e	[SPARK-28957][SQL] Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar" ### What changes were proposed in this pull request? Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar" ### Why are the changes needed? Providing spark side config entry for hive configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #25661 from WeichenXu123/add_hive_conf. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-25 15:54:44 +08:00
Yuanjian Li	b3e9be470c	[SPARK-29229][SQL] Change the additional remote repository in IsolatedClientLoader to google minor ### What changes were proposed in this pull request? Change the remote repo used in IsolatedClientLoader from datanucleus to google mirror. ### Why are the changes needed? We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars. The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is [no longer maintained](http://www.datanucleus.org:15080/downloads/maven2/README.txt). This will cause downloading failure and make hive test cases flaky while Jenkins host is blocked by maven central repo. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #25915 from xuanyuanking/SPARK-29229. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-25 00:49:50 +08:00
Xiao Li	7c02c143aa	[SPARK-28292][SQL] Enable Injection of User-defined Hint ### What changes were proposed in this pull request? Move the rule `RemoveAllHints` after the batch `Resolution`. ### Why are the changes needed? User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test case Closes #25746 from gatorsmile/moveRemoveAllHints. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 18:04:17 +08:00
sheepstop	81de9d3c29	[SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python ### What changes were proposed in this pull request? Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component ### Why are the changes needed? It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios. ### Does this PR introduce any user-facing change? Yes, more info provided to user. ### How was this patch tested? No tests added, only doc change. Closes #25704 from sheepstop/master. Authored-by: sheepstop <yangting617@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-24 18:57:54 +09:00
Yuming Wang	b8b67ae92d	[SPARK-28527][SQL][TEST] Enable ThriftServerQueryTestSuite ### What changes were proposed in this pull request? This PR enable `ThriftServerQueryTestSuite` and fix previously flaky test by: 1. Start thriftserver in `beforeAll()`. 2. Disable `spark.sql.hive.thriftServer.async`. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ```shell build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite " -Phive-thriftserver build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test -Phive-thriftserver ``` Closes #25868 from wangyum/SPARK-28527-enable. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-24 00:44:33 -07:00
windpiger	da7e5c4ffb	[SPARK-19917][SQL] qualified partition path stored in catalog ## What changes were proposed in this pull request? partition path should be qualified to store in catalog. There are some scenes: 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' should be qualified: file:/path/x Hive 2.0.0 does not support for location without schema here. ``` FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. {0} is not absolute or has no scheme information. Please specify a complete absolute uri with scheme information. ``` 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' should be qualified: file:/tablelocation/x Hive 2.0.0 does not support for relative location here. 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' should be qualified: file:/path/x the same with Hive 2.0.0 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' should be qualified: file:/tablelocation/x the same with Hive 2.0.0 Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it. Another change is for alter table location. ## How was this patch tested? add / modify existing TestCases Closes #17254 from windpiger/qualifiedPartitionPath. Authored-by: windpiger <songjun@outlook.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 14:48:47 +08:00
Yuming Wang	0c40b94ae5	[SPARK-29203][SQL][TESTS] Reduce shuffle partitions in SQLQueryTestSuite ### What changes were proposed in this pull request? This PR reduce shuffle partitions from 200 to 4 in `SQLQueryTestSuite` to reduce testing time. ### Why are the changes needed? Reduce testing time. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested in my local: Before: ``` ... [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds) [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds) [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 763 milliseconds) ... Run completed in 1 hour, 22 minutes. ``` After: ``` ... [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds) [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds) [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 360 milliseconds) ... Run completed in 47 minutes. ``` Closes #25891 from wangyum/SPARK-29203. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-23 08:38:40 -07:00
angerszhu	d22768a6be	[SPARK-29036][SQL] SparkThriftServer cancel job after execute() thread interrupted ### What changes were proposed in this pull request? Discuss in https://github.com/apache/spark/pull/25611 If cancel() and close() is called very quickly after the query is started, then they may both call cleanup() before Spark Jobs are started. Then sqlContext.sparkContext.cancelJobGroup(statementId) does nothing. But then the execute thread can start the jobs, and only then get interrupted and exit through here. But then it will exit here, and no-one will cancel these jobs and they will keep running even though this execution has exited. So when execute() was interrupted by `cancel()`, when get into catch block, we should call canJobGroup again to make sure the job was canceled. ### Why are the changes needed? ### Does this PR introduce any user-facing change? NO ### How was this patch tested? MT Closes #25743 from AngersZhuuuu/SPARK-29036. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-23 05:47:25 -07:00
xy_xin	655356e825	[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan ### What changes were proposed in this pull request? This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement: ``` UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate? ``` ### Why are the changes needed? With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New test cases added. Closes #25626 from xianyinxin/SPARK-28892. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-23 19:25:56 +08:00
Takeshi Yamamuro	7a2ea58e78	[SPARK-29084][SQL][TESTS] Check method bytecode size in BenchmarkQueryTest ### What changes were proposed in this pull request? This pr proposes to check method bytecode size in `BenchmarkQueryTest`. This metric is critical for performance numbers. ### Why are the changes needed? For performance checks ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25788 from maropu/CheckMethodSize. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 14:47:42 -07:00
Yuming Wang	51d3509428	[SPARK-28599][SQL] Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage ### What changes were proposed in this pull request? This PR add support sorting `Execution Time` and `Duration` columns for `ThriftServerSessionPage`. ### Why are the changes needed? Previously, it's not sorted correctly. ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Manually do the following and test sorting on those columns in the Spark Thrift Server Session Page. ``` $ sbin/start-thriftserver.sh $ bin/beeline -u jdbc:hive2://localhost:10000 0: jdbc:hive2://localhost:10000> create table t(a int); +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.521 seconds) 0: jdbc:hive2://localhost:10000> select * from t; +----+--+ \| a \| +----+--+ +----+--+ No rows selected (0.772 seconds) 0: jdbc:hive2://localhost:10000> show databases; +---------------+--+ \| databaseName \| +---------------+--+ \| default \| +---------------+--+ 1 row selected (0.249 seconds) ``` Sorted by `Execution Time` column: ![image](https://user-images.githubusercontent.com/5399861/65387476-53038900-dd7a-11e9-885c-fca80287f550.png) Sorted by `Duration` column: ![image](https://user-images.githubusercontent.com/5399861/65387481-6e6e9400-dd7a-11e9-9318-f917247efaa8.png) Closes #25892 from wangyum/SPARK-28599. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 14:12:06 -07:00
Dongjoon Hyun	76bc9db749	[SPARK-29191][TESTS][SQL] Add tag ExtendedSQLTest for SQLQueryTestSuite ### What changes were proposed in this pull request? This PR aims to add tag `ExtendedSQLTest` for `SQLQueryTestSuite`. This doesn't affect our Jenkins test coverage. Instead, this tag gives us an ability to parallelize them by splitting this test suite and the other suites. ### Why are the changes needed? `SQLQueryTestSuite` takes 45 mins alone because it has many SQL scripts to run. <img width="906" alt="time" src="https://user-images.githubusercontent.com/9700541/65353553-4af0f100-dba2-11e9-9f2f-386742d28f92.png"> ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ``` build/sbt "sql/test-only *.SQLQueryTestSuite" -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest ... [info] SQLQueryTestSuite: [info] ScalaTest [info] Run completed in 3 seconds, 147 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 [info] No tests were executed. [info] Passed: Total 0, Failed 0, Errors 0, Passed 0 [success] Total time: 22 s, completed Sep 20, 2019 12:23:13 PM ``` Closes #25872 from dongjoon-hyun/SPARK-29191. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 13:53:21 -07:00
angerszhu	fe4bee8fd8	[SPARK-29162][SQL] Simplify NOT(IsNull(x)) and NOT(IsNotNull(x)) ### What changes were proposed in this pull request? Rewrite ``` NOT isnull(x) -> isnotnull(x) NOT isnotnull(x) -> isnull(x) ``` ### Why are the changes needed? Make LogicalPlan more readable and useful for query canonicalization. Make same condition equal when judge query canonicalization equal ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Newly added UTs. Closes #25878 from AngersZhuuuu/SPARK-29162. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 11:17:47 -07:00
Maxim Gekk	051e691029	[SPARK-28141][SQL] Support special date values ### What changes were proposed in this pull request? Supported special string values for `DATE` type. They are simply notational shorthands that will be converted to ordinary date values when read. The following string values are supported: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html) ### Does this PR introduce any user-facing change? Previously, the parser fails on the special values with the error: ```sql spark-sql> select date 'today'; Error in query: Cannot parse the DATE value: today(line 1, pos 7) ``` After the changes, the special values are converted to appropriate dates: ```sql spark-sql> select date 'today'; 2019-09-06 ``` ### How was this patch tested? - Added tests to `DateFormatterSuite` to check parsing special values from regular strings. - Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String` - Uncommented tests in `date.sql` Closes #25708 from MaxGekk/datetime-special-values. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-22 17:31:33 +09:00
Maxim Gekk	89bad267d4	[SPARK-29200][SQL] Optimize `extract`/`date_part` for epoch ### What changes were proposed in this pull request? Refactoring of the `DateTimeUtils.getEpoch()` function by avoiding decimal operations that are pretty expensive, and converting the final result to the decimal type at the end. ### Why are the changes needed? The changes improve performance of the `getEpoch()` method at least up to 20 times. Before: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 256 277 33 39.0 25.6 1.0X EPOCH of timestamp 23455 23550 131 0.4 2345.5 0.0X ``` After: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 255 294 34 39.2 25.5 1.0X EPOCH of timestamp 1049 1054 9 9.5 104.9 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test from `DateExpressionSuite`. Closes #25881 from MaxGekk/optimize-extract-epoch. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-22 16:59:59 +09:00
Maxim Gekk	3be5741029	[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` ### What changes were proposed in this pull request? Changed the `DateTimeUtils.getMilliseconds()` by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type. ### Why are the changes needed? This improves performance of `extract` and `date_part()` by more than 50 times: Before: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 397 428 45 25.2 39.7 1.0X MILLISECONDS of timestamp 36723 36761 63 0.3 3672.3 0.0X ``` After: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 278 284 6 36.0 27.8 1.0X MILLISECONDS of timestamp 592 606 13 16.9 59.2 0.5X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suite - `DateExpressionsSuite` Closes #25871 from MaxGekk/optimize-epoch-millis. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-21 21:11:31 -07:00
aman_omer	93ac4e1b2d	[SPARK-29053][WEBUI] Sort does not work on some columns ### What changes were proposed in this pull request? Setting custom sort key for duration and execution time column. ### Why are the changes needed? Sorting on duration and execution time columns consider time as a string after converting into readable form which is the reason for wrong sort results as mentioned in [SPARK-29053](https://issues.apache.org/jira/browse/SPARK-29053). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Test manually. Screenshots are attached. After patch: Duration ![Duration](https://user-images.githubusercontent.com/40591404/65339861-93cc9800-dbea-11e9-95e6-63b107a5a372.png) Execution time ![Execution Time](https://user-images.githubusercontent.com/40591404/65339870-97601f00-dbea-11e9-9d1d-690c59bc1bde.png) Closes #25855 from amanomer/SPARK29053. Authored-by: aman_omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-21 07:34:04 -05:00
Jungtaek Lim (HeartSaVioR)	f7cc695808	[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions ### What changes were proposed in this pull request? This patch fixes the issue brought by [SPARK-21870](http://issues.apache.org/jira/browse/SPARK-21870): when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as `byte[]`, but Spark create the code for BinaryType as `[B` and generated code fails compilation. Below is the generated code which failed compilation (Line 380): ``` /* 380 / private void agg_doAggregate_count_0([B agg_expr_1_1, boolean agg_exprIsNull_1_1, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_1) throws java.io.IOException { / 381 / // evaluate aggregate function for count / 382 / boolean agg_isNull_26 = false; / 383 / long agg_value_28 = -1L; / 384 / if (!false && agg_exprIsNull_1_1) { / 385 / long agg_value_31 = agg_unsafeRowAggBuffer_1.getLong(1); / 386 / agg_isNull_26 = false; / 387 / agg_value_28 = agg_value_31; / 388 / } else { / 389 / long agg_value_33 = agg_unsafeRowAggBuffer_1.getLong(1); / 390 / / 391 / long agg_value_32 = -1L; / 392 / / 393 / agg_value_32 = agg_value_33 + 1L; / 394 / agg_isNull_26 = false; / 395 / agg_value_28 = agg_value_32; / 396 / } / 397 / // update unsafe row buffer / 398 / agg_unsafeRowAggBuffer_1.setLong(1, agg_value_28); / 399 */ } ``` There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky. ### Why are the changes needed? Without the fix, generated code from HashAggregateExec may fail compilation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new UT. Without the fix, newly added UT fails. Closes #25830 from HeartSaVioR/SPARK-29140. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-21 16:29:23 +09:00
Maxim Gekk	252b6cf3c9	[SPARK-29187][SQL] Return null from `date_part()` for the null `field` ### What changes were proposed in this pull request? In the PR, I propose to change behavior of the `date_part()` function in handling `null` field, and make it the same as PostgreSQL has. If `field` parameter is `null`, the function should return `null` of the `double` type as PostgreSQL does: ```sql # select date_part(null, date '2019-09-20'); date_part ----------- (1 row) # select pg_typeof(date_part(null, date '2019-09-20')); pg_typeof ------------------ double precision (1 row) ``` ### Why are the changes needed? The `date_part()` function was added to maintain feature parity with PostgreSQL but current behavior of the function is different in handling null as `field`. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select date_part(null, date'2019-09-20'); Error in query: null; line 1 pos 7 ``` After: ```sql spark-sql> select date_part(null, date'2019-09-20'); NULL ``` ### How was this patch tested? Add new tests to `DateFunctionsSuite for 2 cases: - `field` = `null`, `source` = a date literal - `field` = `null`, `source` = a date column Closes #25865 from MaxGekk/date_part-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-20 20:28:56 -07:00
Yuanjian Li	abc88deeed	[SPARK-29063][SQL] Modify fillValue approach to support joined dataframe ### What changes were proposed in this pull request? Modify the approach in `DataFrameNaFunctions.fillValue`, the new one uses `df.withColumns` which only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe. ### Why are the changes needed? Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields. ``` scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", "f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3") scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", "f2", "f4-1")).toDF("f1", "f2", "f4") scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), joinType="left_outer") scala> df_join.na.fill("", cols=Seq("f4")) org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could be: df1.f2, df2.f2.; ``` ### Does this PR introduce any user-facing change? Yes, fillna operation will pass and give the right answer for a joined table. ### How was this patch tested? Local test and newly added UT. Closes #25768 from xuanyuanking/SPARK-29063. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-21 08:26:30 +09:00
Holden Karau	42050c3f4f	[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator ### What changes were proposed in this pull request? This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not. ### Why are the changes needed? In https://issues.apache.org/jira/browse/SPARK-23961 / `5e79ae3b40` we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking. ### Does this PR introduce any user-facing change? A new param is added to toLocalIterator ### How was this patch tested? New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`. I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case. > > 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). > Running timers: > > [Stage 32:> (0 + 1) / 1] > Results: > > Prefetch time: > > 100.228110831 > > > Regular time: > > 188.341721614 > > > Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2019-09-20 09:59:31 -07:00
Burak Yavuz	eb7ee6834d	[SPARK-29062][SQL] Add V1_BATCH_WRITE to the TableCapabilityChecks ### What changes were proposed in this pull request? Currently the checks in the Analyzer require that V2 Tables have BATCH_WRITE defined for all tables that have V1 Write fallbacks. This is confusing as these tables may not have the V2 writer interface implemented yet. This PR adds this table capability to these checks. In addition, this allows V2 tables to leverage the V1 APIs for DataFrameWriter.save if they do extend the V1_BATCH_WRITE capability. This way, these tables can continue to receive partitioning information and also perform checks for the existence of tables, and support all SaveModes. ### Why are the changes needed? Partitioned saves through DataFrame.write are otherwise broken for V2 tables that support the V1 write API. ### Does this PR introduce any user-facing change? No ### How was this patch tested? V1WriteFallbackSuite Closes #25767 from brkyvz/bwcheck. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-20 22:04:32 +08:00
Takeshi Yamamuro	ec8a1a8e88	[SPARK-29122][SQL] Propagate all the SQL conf to executors in SQLQueryTestSuite ### What changes were proposed in this pull request? This pr is to propagate all the SQL configurations to executors in `SQLQueryTestSuite`. When the propagation enabled in the tests, a potential bug below becomes apparent; ``` CREATE TABLE num_data (id int, val decimal(38,10)) USING parquet; .... select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4): QueryOutput(select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4),struct<>,java.lang.IllegalArgumentException [info] requirement failed: MutableProjection cannot use UnsafeRow for output data types: decimal(38,0)) (SQLQueryTestSuite.scala:380) ``` The root culprit is that `InterpretedMutableProjection` has incorrect validation in the interpreter mode: `validExprs.forall { case (e, _) => UnsafeRow.isFixedLength(e.dataType) }`. This validation should be the same with the condition (`isMutable`) in `HashAggregate.supportsAggregate`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126 ### Why are the changes needed? Bug fixes. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added tests in `AggregationQuerySuite` Closes #25831 from maropu/SPARK-29122. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-20 21:41:09 +09:00
Jungtaek Lim (HeartSaVioR)	5e92301723	[SPARK-29161][CORE][SQL][STREAMING] Unify default wait time for waitUntilEmpty ### What changes were proposed in this pull request? This is a follow-up of the [review comment](https://github.com/apache/spark/pull/25706#discussion_r321923311). This patch unifies the default wait time to be 10 seconds as it would fit most of UTs (as they have smaller timeouts) and doesn't bring additional latency since it will return if the condition is met. This patch doesn't touch the one which waits 100000 milliseconds (100 seconds), to not break anything unintentionally, though I'd rather questionable that we really need to wait for 100 seconds. ### Why are the changes needed? It simplifies the test code and get rid of various heuristic values on timeout. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? CI build will test the patch, as it would be the best environment to test the patch (builds are running there). Closes #25837 from HeartSaVioR/MINOR-unify-default-wait-time-for-wait-until-empty. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 23:11:54 -07:00
Dongjoon Hyun	5b478416f8	[SPARK-28208][SQL][FOLLOWUP] Use `tryWithResource` pattern ### What changes were proposed in this pull request? This PR aims to use `tryWithResource` for ORC file. ### Why are the changes needed? This is a follow-up to address https://github.com/apache/spark/pull/25006#discussion_r298788206 . ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #25842 from dongjoon-hyun/SPARK-28208. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 15:33:12 -07:00
Ryan Blue	2c775f418f	[SPARK-28612][SQL] Add DataFrameWriterV2 API ## What changes were proposed in this pull request? This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API: * Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans. * Only creates v2 logical plans so the behavior is always consistent. * Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`. Here are a few example uses of the new API: ```scala df.writeTo("catalog.db.table").append() df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01") df.writeTo("catalog.db.table").overwritePartitions() df.writeTo("catalog.db.table").asParquet.create() df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace() df.writeTo("catalog.db.table").using("abc").replace() ``` ## How was this patch tested? Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans. Closes #25681 from rdblue/SPARK-28612-add-data-frame-writer-v2. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-09-19 13:32:09 -07:00
Jungtaek Lim (HeartSaVioR)	eee2e026bb	[SPARK-29165][SQL][TEST] Set log level of log generated code as ERROR in case of compile error on generated code in UT ### What changes were proposed in this pull request? This patch proposes to change the log level of logging generated code in case of compile error being occurred in UT. This would help to investigate compilation issue of generated code easier, as currently we got exception message of line number but there's no generated code being logged actually (as in most cases of UT the threshold of log level is at least WARN). ### Why are the changes needed? This would help investigating issue on compilation error for generated code in UT. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25835 from HeartSaVioR/MINOR-always-log-generated-code-on-fail-to-compile-in-unit-testing. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 11:47:47 -07:00
Sean Owen	c5d8a51f3b	[MINOR][BUILD] Fix about 15 misc build warnings ### What changes were proposed in this pull request? This addresses about 15 miscellaneous warnings that appear in the current build. ### Why are the changes needed? No functional changes, it just slightly reduces the amount of extra warning output. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests, run manually. Closes #25852 from srowen/BuildWarnings. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 11:37:42 -07:00
Gengliang Wang	b917a6593d	[SPARK-28989][SQL] Add a SQLConf `spark.sql.ansi.enabled` ### What changes were proposed in this pull request? Currently, there are new configurations for compatibility with ANSI SQL: * `spark.sql.parser.ansi.enabled` * `spark.sql.decimalOperations.nullOnOverflow` * `spark.sql.failOnIntegralTypeOverflow` This PR is to add new configuration `spark.sql.ansi.enabled` and remove the 3 options above. When the configuration is true, Spark tries to conform to the ANSI SQL specification. It will be disabled by default. ### Why are the changes needed? Make it simple and straightforward. ### Does this PR introduce any user-facing change? The new features for ANSI compatibility will be set via one configuration `spark.sql.ansi.enabled`. ### How was this patch tested? Existing unit tests. Closes #25693 from gengliangwang/ansiEnabled. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-18 22:30:28 -07:00
Maxim Gekk	a6a663c437	[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks ### What changes were proposed in this pull request? Refactored SQL-related benchmark and made them depend on `SqlBasedBenchmark`. In particular, creation of Spark session are moved into `override def getSparkSession: SparkSession`. ### Why are the changes needed? This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the modified benchmarks. Closes #25828 from MaxGekk/sql-benchmarks-refactoring. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 17:52:23 -07:00
Yuming Wang	8c3f27ceb4	[SPARK-28683][BUILD] Upgrade Scala to 2.12.10 ## What changes were proposed in this pull request? This PR upgrade Scala to 2.12.10. Release notes: - Fix regression in large string interpolations with non-String typed splices - Revert "Generate shallower ASTs in pattern translation" - Fix regression in classpath when JARs have 'a.b' entries beside 'a/b' - Faster compiler: 5–10% faster since 2.12.8 - Improved compatibility with JDK 11, 12, and 13 - Experimental support for build pipelining and outline type checking More details: https://github.com/scala/scala/releases/tag/v2.12.10 https://github.com/scala/scala/releases/tag/v2.12.9 ## How was this patch tested? Existing tests Closes #25404 from wangyum/SPARK-28683. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 13:30:36 -07:00
bartosz25	b4b2e958ce	[MINOR][SS][DOCS] Adapt multiple watermark policy comment to the reality ### What changes were proposed in this pull request? Previous comment was true for Apache Spark 2.3.0. The 2.4.0 release brought multiple watermark policy and therefore stating that the 'min' is always chosen is misleading. This PR updates the comments about multiple watermark policy. They aren't true anymore since in case of multiple watermarks, we can configure which one will be applied to the query. This change was brought with Apache Spark 2.4.0 release. ### Why are the changes needed? It introduces some confusion about the real execution of the commented code. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The tests weren't added because the change is only about the documentation level. I affirm that the contribution is my original work and that I license the work to the project under the project's open source license. Closes #25832 from bartosz25/fix_comments_multiple_watermark_policy. Authored-by: bartosz25 <bartkonieczny@yahoo.fr> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 10:51:11 -07:00
Owen O'Malley	dfb0a8bb04	[SPARK-28208][BUILD][SQL] Upgrade to ORC 1.5.6 including closing the ORC readers ## What changes were proposed in this pull request? It upgrades ORC from 1.5.5 to 1.5.6 and adds closes the ORC readers when they aren't used to create RecordReaders. ## How was this patch tested? The changed unit tests were run. Closes #25006 from omalley/spark-28208. Lead-authored-by: Owen O'Malley <omalley@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 09:32:43 -07:00
John Zhuge	ee94b5d701	[SPARK-29030][SQL] Simplify lookupV2Relation ## What changes were proposed in this pull request? Simplify the return type for `lookupV2Relation` which makes the 3 callers more straightforward. ## How was this patch tested? Existing unit tests. Closes #25735 from jzhuge/lookupv2relation. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-09-18 09:27:11 -07:00
sandeep katta	376e17c082	[SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected ### What changes were proposed in this pull request? #DataSet fruit,color,price,quantity apple,red,1,3 banana,yellow,2,4 orange,orange,3,5 xxx This PR aims to fix the below ``` scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count res1: Long = 4 ``` This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645). SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387) ### Why are the changes needed? SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387 ### Does this PR introduce any user-facing change? No, ### How was this patch tested? Added UT, and also tested the bug SPARK-24645 SPARK-24645 regression ![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png) Closes #25820 from sandeep-katta/SPARK-29101. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:33:13 +09:00
Maxim Gekk	c2734ab1fc	[SPARK-29012][SQL] Support special timestamp values ### What changes were proposed in this pull request? Supported special string values for `TIMESTAMP` type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` -midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html) ### Does this PR introduce any user-facing change? Previously, the parser fails on the special values with the error: ```sql spark-sql> select timestamp 'today'; Error in query: Cannot parse the TIMESTAMP value: today(line 1, pos 7) ``` After the changes, the special values are converted to appropriate dates: ```sql spark-sql> select timestamp 'today'; 2019-09-06 00:00:00 ``` ### How was this patch tested? - Added tests to `TimestampFormatterSuite` to check parsing special values from regular strings. - Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String` - Uncommented tests in `timestamp.sql` Closes #25716 from MaxGekk/timestamp-special-values. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:30:59 +09:00
Gengliang Wang	3da2786dc6	[SPARK-29096][SQL] The exact math method should be called only when there is a corresponding function in Math ### What changes were proposed in this pull request? 1. After https://github.com/apache/spark/pull/21599, if the option "spark.sql.failOnIntegralTypeOverflow" is enabled, all the Binary Arithmetic operator will used the exact version function. However, only `Add`/`Substract`/`Multiply` has a corresponding exact function in java.lang.Math . When the option "spark.sql.failOnIntegralTypeOverflow" is enabled, a runtime exception "BinaryArithmetics must override either exactMathMethod or genCode" is thrown if the other Binary Arithmetic operators are used, such as "Divide", "Remainder". The exact math method should be called only when there is a corresponding function in `java.lang.Math` 2. Revise the log output of casting to `Int`/`Short` 3. Enable `spark.sql.failOnIntegralTypeOverflow` for pgSQL tests in `SQLQueryTestSuite`. ### Why are the changes needed? 1. Fix the bugs of https://github.com/apache/spark/pull/21599 2. The test case of pgSQL intends to check the overflow of integer/long type. We should enable `spark.sql.failOnIntegralTypeOverflow`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #25804 from gengliangwang/enableIntegerOverflowInSQLTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-18 16:59:17 +08:00
turbofei	eef5e6d348	[SPARK-29113][DOC] Fix some annotation errors and remove meaningless annotations in project ### What changes were proposed in this pull request? In this PR, I fix some annotation errors and remove meaningless annotations in project. ### Why are the changes needed? There are some annotation errors and meaningless annotations in project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Verified manually. Closes #25809 from turboFei/SPARK-29113. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 13:12:18 +09:00
s71955	4559a82a1d	[SPARK-28930][SQL] Last Access Time value shall display 'UNKNOWN' in all clients What changes were proposed in this pull request? Issue 1 : modifications not required as these are different formats for the same info. In the case of a Spark DataFrame, null is correct. Issue 2 mentioned in JIRA Spark SQL "desc formatted tablename" is not showing the header # col_name,data_type,comment , seems to be the header has been removed knowingly as part of SPARK-20954. Issue 3: Corrected the Last Access time, the value shall display 'UNKNOWN' as currently system wont support the last access time evaluation, since hive was setting Last access time as '0' in metastore even though spark CatalogTable last access time value set as -1. this will make the validation logic of LasAccessTime where spark sets 'UNKNOWN' value if last access time value set as -1 (means not evaluated). Does this PR introduce any user-facing change? No How was this patch tested? Locally and corrected a ut. Attaching the test report below ![SPARK-28930](https://user-images.githubusercontent.com/12999161/64484908-83a1d980-d236-11e9-8062-9facf3003e5e.PNG) Closes #25720 from sujith71955/master_describe_info. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 12:54:44 +09:00
Chris Martin	05988b256e	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs ### What changes were proposed in this pull request? Adds a new cogroup Pandas UDF. This allows two grouped dataframes to be cogrouped together and apply a (pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame UDF to each cogroup. Example usage ``` from pyspark.sql.functions import pandas_udf, PandasUDFType df1 = spark.createDataFrame( [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) pandas_udf("time int, id int, v1 double, v2 string", PandasUDFType.COGROUPED_MAP) def asof_join(l, r): return pd.merge_asof(l, r, on="time", by="id") df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() ``` +--------+---+---+---+ \| time\| id\| v1\| v2\| +--------+---+---+---+ \|20000101\| 1\|1.0\| x\| \|20000102\| 1\|3.0\| x\| \|20000101\| 2\|2.0\| y\| \|20000102\| 2\|4.0\| y\| +--------+---+---+---+ ### How was this patch tested? Added unit test test_pandas_udf_cogrouped_map Closes #24981 from d80tb7/SPARK-27463-poc-arrow-stream. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-09-17 17:13:50 -07:00
Maxim Gekk	02db706090	[SPARK-29115][SQL][TEST] Add benchmarks for make_date() and make_timestamp() ### What changes were proposed in this pull request? Added new benchmarks for `make_date()` and `make_timestamp()` to detect performance issues, and figure out functions speed on foldable arguments. - `make_date()` is benchmarked on fully foldable arguments. - `make_timestamp()` is benchmarked on corner case `60.0`, foldable time fields and foldable date. ### Why are the changes needed? To find out inputs where `make_date()` and `make_timestamp()` have performance problems. This should be useful in the future optimizations of the functions and users apps. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark and manually checking generated dates/timestamps. Closes #25813 from MaxGekk/make_datetime-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-17 15:09:16 -07:00
xy_xin	3fc52b5557	[SPARK-28950][SQL] Refine the code of DELETE ### What changes were proposed in this pull request? This pr refines the code of DELETE, including, 1, make `whereClause` to be optional, in which case DELETE will delete all of the data of a table; 2, add more test cases; 3, some other refines. This is a following-up of SPARK-28351. ### Why are the changes needed? An optional where clause in DELETE respects the SQL standard. ### Does this PR introduce any user-facing change? Yes. But since this is a non-released feature, this change does not have any end-user affects. ### How was this patch tested? New case is added. Closes #25652 from xianyinxin/SPARK-28950. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-18 01:14:14 +08:00
Maxim Gekk	db996ccad9	[SPARK-29074][SQL] Optimize `date_format` for foldable `fmt` ### What changes were proposed in this pull request? In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable. ### Why are the changes needed? The changes improve performance of the `date_format()` function. Before: ``` format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ format date wholestage off 7180 / 7181 1.4 718.0 1.0X format date wholestage on 7051 / 7194 1.4 705.1 1.0X ``` After: ``` format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ format date wholestage off 4787 / 4839 2.1 478.7 1.0X format date wholestage on 4736 / 4802 2.1 473.6 1.0X ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #25782 from MaxGekk/date_format-foldable. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-17 16:00:10 +09:00
Jungtaek Lim (HeartSaVioR)	c8628354b7	[SPARK-28996][SQL][TESTS] Add tests regarding username of HiveClient ### What changes were proposed in this pull request? This patch proposes to add new tests to test the username of HiveClient to prevent changing the semantic unintentionally. The owner of Hive table has been changed back-and-forth, principal -> username -> principal, and looks like the change is not intentional. (Please refer [SPARK-28996](https://issues.apache.org/jira/browse/SPARK-28996) for more details.) This patch intends to prevent this. This patch also renames previous HiveClientSuite(s) to HivePartitionFilteringSuite(s) as it was commented as TODO, as well as previous tests are too narrowed to test only partition filtering. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Newly added UTs. Closes #25696 from HeartSaVioR/SPARK-28996. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-17 14:04:23 +08:00
Liang-Chi Hsieh	dffd92e977	[SPARK-29100][SQL] Fix compilation error in codegen with switch from InSet expression ### What changes were proposed in this pull request? When InSet generates Java switch-based code, if the input set is empty, we don't generate switch condition, but a simple expression that is default case of original switch. ### Why are the changes needed? SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error: ``` [info] - SPARK-29100: InSet with empty input set * FAILED * (58 milliseconds) [info] Code generation of input[0, int, true] INSET () failed: [info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1 [info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1 [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383) ``` ### Does this PR introduce any user-facing change? Yes. Previously, when users have InSet against an empty set, generated code causes compilation error. This patch fixed it. ### How was this patch tested? Unit test added. Closes #25806 from viirya/SPARK-29100. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-17 11:06:10 +08:00
Takeshi Yamamuro	95073fb62b	[SPARK-29008][SQL] Define an individual method for each common subexpression in HashAggregateExec ### What changes were proposed in this pull request? This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; `4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)` The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`, the current master generates; ``` /* 098 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException { / 099 / // do aggregate / 100 / // common sub-expressions / 101 / int agg_value_6 = -1; / 102 / / 103 / agg_value_6 = agg_expr_0_0 + agg_expr_1_0; / 104 / / 105 / int agg_value_5 = -1; / 106 / / 107 / agg_value_5 = agg_value_6 + agg_expr_2_0; / 108 / boolean agg_isNull_4 = false; / 109 / long agg_value_4 = -1L; / 110 / if (!false) { / 111 / agg_value_4 = (long) agg_value_5; / 112 / } / 113 / int agg_value_10 = -1; / 114 / / 115 / agg_value_10 = agg_expr_0_0 + agg_expr_1_0; / 116 / // evaluate aggregate functions and update aggregation buffers / 117 / agg_doAggregate_sum_0(agg_value_10); / 118 / agg_doAggregate_avg_0(agg_value_4, agg_isNull_4); / 119 / / 120 / } ``` On the other hand, this pr generates; ``` / 121 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException { / 122 / // do aggregate / 123 / // common sub-expressions / 124 / long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0); / 125 / int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0); / 126 / // evaluate aggregate functions and update aggregation buffers / 127 / agg_doAggregate_sum_0(agg_subExprValue_1); / 128 / agg_doAggregate_avg_0(agg_subExprValue_0); / 129 / / 130 / } ``` I run some micro benchmarks for this pr; ``` (base) maropu~:$system_profiler SPHardwareDataType Hardware: Hardware Overview: Processor Name: Intel Core i5 Processor Speed: 2 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Memory: 8 GB (base) maropu~:$java -version java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) (base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v val numCols = 40 val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" } spark.range(3000000).selectExpr(colExprs: _).createOrReplaceTempView("t") val aggExprs = (2 until numCols).map { i => (0 until i).map(d => s"_c$d") .mkString("AVG(", " + ", ")") } // Drops the time of a first run then pick that of a second run timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() } // the master maxCodeGen: 12957 Elapsed time: 36.309858661s // this pr maxCodeGen=4184 Elapsed time: 2.399490285s ``` ### Why are the changes needed? To avoid the too-long-function issue in JVMs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `WholeStageCodegenSuite` Closes #25710 from maropu/SplitSubexpr. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-17 11:09:55 +09:00
hongdd	5881871ca5	[SPARK-26929][SQL] fix table owner use user instead of principal when create table through spark-sql or beeline …create table through spark-sql or beeline ## What changes were proposed in this pull request? fix table owner use user instead of principal when create table through spark-sql private val userName = conf.getUser will get ugi's userName which is principal info, and i copy the source code into HiveClientImpl, and use ugi.getShortUserName() instead of ugi.getUserName(). The owner display correctly. ## How was this patch tested? 1. create a table in kerberos cluster 2. use "desc formatted tbName" check owner Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23952 from hddong/SPARK-26929-fix-table-owner. Lead-authored-by: hongdd <jn_hdd@163.com> Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-16 11:07:50 -07:00
Takeshi Yamamuro	6297287dfa	[SPARK-29061][SQL] Prints bytecode statistics in debugCodegen ### What changes were proposed in this pull request? This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following; ``` scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) == ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L]) +- (1) LocalTableScan [v#0] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 */ } ... ``` ### Why are the changes needed? For efficient developments ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested Closes #25766 from maropu/PrintBytecodeStats. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-16 21:48:07 +08:00
Wenchen Fan	1b99d0cca4	[SPARK-29069][SQL] ResolveInsertInto should not do table lookup ### What changes were proposed in this pull request? It's more clear to only do table lookup in `ResolveTables` rule (for v2 tables) and `ResolveRelations` rule (for v1 tables). `ResolveInsertInto` should only resolve the `InsertIntoStatement` with resolved relations. ### Why are the changes needed? to make the code simpler ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25774 from cloud-fan/simplify. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-16 09:46:34 +09:00
changchun.wang	b91648cfd0	[SPARK-28856][FOLLOW-UP][SQL][TEST] Add the `namespaces` keyword to TableIdentifierParserSuite ### What changes were proposed in this pull request? This PR add the `namespaces` keyword to `TableIdentifierParserSuite`. ### Why are the changes needed? Improve the test. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25758 from highmoutain/3.0bugfix. Authored-by: changchun.wang <251922566@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:11:38 -07:00
Jungtaek Lim (HeartSaVioR)	61e5aebce3	[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping ### What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch. Closes #25790 from HeartSaVioR/SPARK-29046-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:04:56 -07:00
Maxim Gekk	1b7afc0c98	[SPARK-28471][SQL][DOC][FOLLOWUP] Fix year patterns in the comments of date-time expressions ### What changes were proposed in this pull request? In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one. ### Why are the changes needed? To make comments consistent to implementations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Scala Style checker. Closes #25796 from MaxGekk/year-pattern-uuuu-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:02:15 -07:00
Dongjoon Hyun	13b77e52d2	Revert "[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping" This reverts commit `850833fa17`.	2019-09-14 00:09:45 -07:00
Juliusz Sompolski	fcf9b41b49	[SPARK-29056] ThriftServerSessionPage displays 1970/01/01 finish and close time when unset ### What changes were proposed in this pull request? ThriftServerSessionPage displays timestamp 0 (1970/01/01) instead of nothing if query finish time and close time are not set. ![image](https://user-images.githubusercontent.com/25019163/64711118-6d578000-d4b9-11e9-9b11-2e3616319a98.png) Change it to display nothing, like ThriftServerPage. ### Why are the changes needed? Obvious bug. ### Does this PR introduce any user-facing change? Finish time and Close time will be displayed correctly on ThriftServerSessionPage in JDBC/ODBC Spark UI. ### How was this patch tested? Manual test. Closes #25762 from juliuszsompolski/SPARK-29056. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-13 09:13:57 -07:00
WeichenXu	5631a96367	[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection ### What changes were proposed in this pull request? The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run. In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions. ### Why are the changes needed? `Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually benchmark it in spark-shell: ``` def testExplainTime(collectionSize: Int) = { val df = spark.range(10).withColumn("id2", col("id") + 1) val list = Range(0, collectionSize).toList val startTime = System.currentTimeMillis() df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain() val elapsedTime = System.currentTimeMillis() - startTime println(s"cost time: ${elapsedTime}ms") } ``` Then test on collection size 5, 10, 100, 1000, 10000, test result is: collection size \| explain time (before) \| explain time (after) ------ \| ------ \| ------ 5 \| 26ms \| 29ms 10 \| 30ms \| 48ms 100 \| 104ms \| 50ms 1000 \| 1202ms \| 58ms 10000 \| 10012ms \| 523ms Closes #25754 from WeichenXu123/improve_in_collection. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-12 17:23:08 -07:00
maryannxue	c56a012bc8	[SPARK-29060][SQL] Add tree traversal helper for adaptive spark plans ### What changes were proposed in this pull request? This PR adds a utility class `AdaptiveSparkPlanHelper` which provides methods related to tree traversal of an `AdaptiveSparkPlanExec` plan. Unlike their counterparts in `TreeNode` or `QueryPlan`, these methods traverse down leaf nodes of adaptive plans, i.e., `AdaptiveSparkPlanExec` and `QueryStageExec`. ### Why are the changes needed? This utility class can greatly simplify tree traversal code for adaptive spark plans. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Refined `AdaptiveQueryExecSuite` with the help of the new utility methods. Closes #25764 from maryannxue/aqe-utils. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-12 21:49:21 +08:00
Maxim Gekk	8e9fafbb21	[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark ### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR https://github.com/apache/spark/pull/25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes #25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-12 21:32:35 +09:00
Wenchen Fan	053dd858d3	[SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes ### What changes were proposed in this pull request? reorganize the packages of DS v2 interfaces/classes: 1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes 2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes 3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes 4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes ### Why are the changes needed? Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25700 from cloud-fan/package. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-12 19:59:34 +08:00

1 2 3 4 5 ...

8476 commits