ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Jiajia Li	dc0bc7a6eb	[MINOR][DOCS] Fix some typos ### What changes were proposed in this pull request? This PR proposes a few typos: 1. Sparks => Spark's 2. parallize => parallelize 3. doesnt => doesn't Closes #26140 from plusplusjiajia/fix-typos. Authored-by: Jiajia Li <jiajia.li@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-17 07:22:01 -07:00
Kent Yao	4b902d3b45	[SPARK-29491][SQL] Add bit_count function support ### What changes were proposed in this pull request? BIT_COUNT(N) - Returns the number of bits that are set in the argument N as an unsigned 64-bit integer, or NULL if the argument is NULL ### Why are the changes needed? Supported by MySQL，Microsoft SQL Server ，etc. ### Does this PR introduce any user-facing change? add a built-in function ### How was this patch tested? add uts Closes #26139 from yaooqinn/SPARK-29491. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-17 20:22:38 +08:00
Yuanjian Li	239ee3f561	[SPARK-9853][CORE] Optimize shuffle fetch of continuous partition IDs This PR takes over #19788. After we split the shuffle fetch protocol from `OpenBlock` in #24565, this optimization can be extended in the new shuffle protocol. Credit to yucai, closes #19788. ### What changes were proposed in this pull request? This PR adds the support for continuous shuffle block fetching in batch: - Shuffle client changes: - Add new feature tag `spark.shuffle.fetchContinuousBlocksInBatch`, implement the decision logic in `BlockStoreShuffleReader`. - Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator. - Shuffle server changes: - Add support in `ExternalBlockHandler` for the external shuffle service side. - Make `ShuffleBlockResolver.getBlockData` accept getting block data by range. - Protocol changes: - Add new block id type `ShuffleBlockBatchId` represent continuous shuffle block ids. - Extend `FetchShuffleBlocks` and `OneForOneBlockFetcher`. - After the new shuffle fetch protocol completed in #24565, the backward compatibility for external shuffle service can be controlled by `spark.shuffle.useOldFetchProtocol`. ### Why are the changes needed? In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example: The shuffle block is stored like below: ![image](https://user-images.githubusercontent.com/2989575/51654634-c37fbd80-1fd3-11e9-935e-5652863676c3.png) The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala. In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14. Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file. After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add new UT. Integrate test with setting `spark.sql.adaptive.enabled=true`. Closes #26040 from xuanyuanking/SPARK-9853. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: yucai <yyu1@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-17 14:47:56 +08:00
Kent Yao	6d4cc7b855	[SPARK-27880][SQL] Add bool_and for every and bool_or for any as function aliases ### What changes were proposed in this pull request? bool_or(x) <=> any/some(x) <=> max(x) bool_and(x) <=> every(x) <=> min(x) Args: x: boolean ### Why are the changes needed? PostgreSQL, Presto and Vertica, etc also support this feature: ### Does this PR introduce any user-facing change? add new functions support ### How was this patch tested? add ut Closes #26126 from yaooqinn/SPARK-27880. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-16 22:43:47 +08:00
Maxim Gekk	d11cbf2e36	[SPARK-29364][SQL] Return an interval from date subtract according to SQL standard ### What changes were proposed in this pull request? Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example: ```sql > select date'tomorrow' - date'yesterday'; interval 2 days ``` Closes #26034 ### Why are the changes needed? - To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt). - Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')` ### Does this PR introduce any user-facing change? Before the query below returns number of days: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; 399 ``` After it returns an interval: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; interval 1 years 1 months 4 days ``` ### How was this patch tested? - by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`. - by existing tests in `date.sql` Closes #26112 from MaxGekk/date-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-16 06:26:01 -07:00
Yuming Wang	e00344edc1	[SPARK-29423][SS] lazily initialize StreamingQueryManager in SessionState ### What changes were proposed in this pull request? This PR makes `SessionState` lazily initialize `StreamingQueryManager` to avoid constructing `StreamingQueryManager` for each session when connecting to ThriftServer. ### Why are the changes needed? Reduce memory usage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test 1. Start thriftserver: ``` build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh ``` 2. Open a session: ``` bin/beeline -u jdbc:hive2://localhost:10000 ``` 3. Check `StreamingQueryManager` instance: ``` jcmd \| grep HiveThriftServer2 \| awk -F ' ' '{print $1}' \| xargs jmap -histo \| grep StreamingQueryManager ``` Before this PR: ``` [rootspark-3267648 spark]# jcmd \| grep HiveThriftServer2 \| awk -F ' ' '{print $1}' \| xargs jmap -histo \| grep StreamingQueryManager 1954: 2 96 org.apache.spark.sql.streaming.StreamingQueryManager ``` After this PR: ``` [rootspark-3267648 spark]# jcmd \| grep HiveThriftServer2 \| awk -F ' ' '{print $1}' \| xargs jmap -histo \| grep StreamingQueryManager [rootspark-3267648 spark]# ``` Closes #26089 from wangyum/SPARK-29423. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-15 21:08:15 -07:00
Wenchen Fan	51f10ed90f	[SPARK-28560][SQL][FOLLOWUP] code cleanup for local shuffle reader ### What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/25295 This PR proposes a few code cleanups: 1. rename the special `getMapSizesByExecutorId` to `getMapSizesByMapIndex` 2. rename the parameter `mapId` to `mapIndex` as that's really a mapper index. 3. `BlockStoreShuffleReader` should take `blocksByAddress` directly instead of a map id. 4. rename `getMapReader` to `getReaderForOneMapper` to be more clearer. ### Why are the changes needed? make code easier to understand ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26128 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-16 11:19:16 +08:00
Jeff Evans	95de93b24e	[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV ### What changes were proposed in this pull request? Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest. ### Why are the changes needed? It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing). ### Does this PR introduce any user-facing change? Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0. ### How was this patch tested? The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed. Closes #26027 from jeff303/SPARK-24540. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-15 15:44:51 -05:00
Gengliang Wang	322ec0ba9b	[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default ### What changes were proposed in this pull request? When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy": 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow). 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1. 3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default. Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0. ### Why are the changes needed? Following the ANSI SQL standard is most reasonable among the 3 policies. ### Does this PR introduce any user-facing change? Yes. The default store assignment policy is ANSI for both V1 and V2 data sources. ### How was this patch tested? Unit test Closes #26107 from gengliangwang/ansiPolicyAsDefault. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-15 10:41:37 -07:00
jiake	9ac4b2dbc5	[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution ## What changes were proposed in this pull request? Implement a rule in the new adaptive execution framework introduced in [SPARK-23128](https://issues.apache.org/jira/browse/SPARK-23128). This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution. ## How was this patch tested? Existing tests Closes #25295 from JkSelf/localShuffleOptimization. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-15 21:51:15 +08:00
Wenchen Fan	8915966bf4	[SPARK-29473][SQL] move statement logical plans to a new file ### What changes were proposed in this pull request? move the statement logical plans that were created for v2 commands to a new file `statements.scala`, under the same package of `v2Commands.scala`. This PR also includes some minor cleanups: 1. remove `private[sql]` from `ParsedStatement` as it's in the private package. 2. remove unnecessary override of `output` and `children`. 3. add missing classdoc. ### Why are the changes needed? Similar to https://github.com/apache/spark/pull/26111 , this is to better organize the logical plans of data source v2. It's a bit weird to put the statements in the package `org.apache.spark.sql.catalyst.plans.logical.sql` as `sql` is not a good sub-package name in Spark SQL. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26125 from cloud-fan/statement. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-15 15:05:49 +02:00
yangjie01	a988aaf3fa	[SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized mode ### What changes were proposed in this pull request? There will be 2 times unsafeProjection convert operation When we read a Parquet data file use non-vectorized mode: 1. `ParquetGroupConverter` call unsafeProjection function to covert `SpecificInternalRow` to `UnsafeRow` every times when read Parquet data file use `ParquetRecordReader`. 2. `ParquetFileFormat` will call unsafeProjection function to covert this `UnsafeRow` to another `UnsafeRow` again when partitionSchema is not empty in DataSourceV1 branch, and `PartitionReaderWithPartitionValues` will always do this convert operation in DataSourceV2 branch. In this pr, remove `unsafeProjection` convert operation in `ParquetGroupConverter` and change `ParquetRecordReader` to produce `SpecificInternalRow` instead of `UnsafeRow`. ### Why are the changes needed? The first time convert in `ParquetGroupConverter` is redundant and `ParquetRecordReader` return a `InternalRow(SpecificInternalRow)` is enough. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit Test Closes #26106 from LuciferYang/spark-parquet-unsafe-projection. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-15 12:42:42 +08:00
Wenchen Fan	9407fba037	[SPARK-29412][SQL] refine the document of v2 session catalog config ### What changes were proposed in this pull request? Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it. ### Why are the changes needed? Make this config more understandable ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the newly updated test cases. Closes #26071 from cloud-fan/config. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-15 10:18:58 +08:00
Dongjoon Hyun	ff9fcd501c	Revert "[SPARK-29107][SQL][TESTS] Port window.sql (Part 1)" This reverts commit `81915dacc4`.	2019-10-14 15:15:32 -07:00
Dongjoon Hyun	e696c36e32	[SPARK-29442][SQL] Set `default` mode should override the existing mode ### What changes were proposed in this pull request? This PR aims to fix the behavior of `mode("default")` to set `SaveMode.ErrorIfExists`. Also, this PR updates the exception message by adding `default` explicitly. ### Why are the changes needed? This is reported during `GRAPH API` PR. This builder pattern should work like the documentation. ### Does this PR introduce any user-facing change? Yes if the app has multiple `mode()` invocation including `mode("default")` and the `mode("default")` is the last invocation. This is really a corner case. - Previously, the last invocation was handled as `No-Op`. - After this bug fix, it will work like the documentation. ### How was this patch tested? Pass the Jenkins with the newly added test case. Closes #26094 from dongjoon-hyun/SPARK-29442. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-14 13:11:05 -07:00
DylanGuedes	81915dacc4	[SPARK-29107][SQL][TESTS] Port window.sql (Part 1) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/window.sql from lines 1~319 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/window.out ## How was this patch tested? Pass the Jenkins. ### Why are the changes needed? To ensure compatibility with PGSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Comparison with PgSQL results. Closes #25816 from DylanGuedes/spark-29107. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-14 10:17:16 -07:00
Maxim Gekk	da576a737c	[SPARK-29369][SQL] Support string intervals without the `interval` prefix ### What changes were proposed in this pull request? In the PR, I propose to move interval parsing to `CalendarInterval.fromCaseInsensitiveString()` which throws an `IllegalArgumentException` for invalid strings, and reuse it from `CalendarInterval.fromString()`. The former one handles `IllegalArgumentException` only and returns `NULL` for invalid interval strings. This will allow to support interval strings without the `interval` prefix in casting strings to intervals and in interval type constructor because they use `fromString()` for parsing string intervals. For example: ```sql spark-sql> select cast('1 year 10 days' as interval); interval 1 years 1 weeks 3 days spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS'; interval 1 years 1 weeks 3 days ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL which supports interval strings without prefix: ```sql # select interval '2 months 1 microsecond'; interval ------------------------ 2 mons 00:00:00.000001 ``` and to improve Spark SQL UX. ### Does this PR introduce any user-facing change? Yes, previously parsing of interval strings without `interval` gives `NULL`: ```sql spark-sql> select interval '2 months 1 microsecond'; NULL ``` After: ```sql spark-sql> select interval '2 months 1 microsecond'; interval 2 months 1 microseconds ``` ### How was this patch tested? - Added new tests to `CalendarIntervalSuite.java` - A test for casting strings to intervals in `CastSuite` - Test for interval type constructor from strings in `ExpressionParserSuite` Closes #26079 from MaxGekk/interval-str-without-prefix. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-14 23:34:18 +08:00
Terry Kim	ef6dce29b2	[SPARK-29279][SQL] Merge SHOW NAMESPACES and SHOW DATABASES code path ### What changes were proposed in this pull request? Currently, `SHOW NAMESPACES` and `SHOW DATABASES` are separate code paths. This PR merges two implementations. ### Why are the changes needed? To remove code/behavior duplication ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new unit tests. Closes #26006 from imback82/combine_show. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-14 22:35:26 +08:00
Peter Toth	9e12c94c15	[SPARK-29359][SQL][TESTS] Better exception handling in (SQL\|ThriftServer)QueryTestSuite ### What changes were proposed in this pull request? This PR adds 2 changes regarding exception handling in `SQLQueryTestSuite` and `ThriftServerQueryTestSuite` - fixes an expected output sorting issue in `ThriftServerQueryTestSuite` as if there is an exception then there is no need for sort - introduces common exception handling in those 2 suites with a new `handleExceptions` method ### Why are the changes needed? Currently `ThriftServerQueryTestSuite` passes on master, but it fails on one of my PRs (https://github.com/apache/spark/pull/23531) with this error (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111651/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/sql_3/): ``` org.scalatest.exceptions.TestFailedException: Expected " [Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit org.apache.spark.SparkException] ", but got " [org.apache.spark.SparkException Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit] " Result did not match for query #4 WITH RECURSIVE r(level) AS ( VALUES (0) UNION ALL SELECT level + 1 FROM r ) SELECT * FROM r ``` The unexpected reversed order of expected output (error message comes first, then the exception class) is due to this line: https://github.com/apache/spark/pull/26028/files#diff-b3ea3021602a88056e52bf83d8782de8L146. It should not sort the expected output if there was an error during execution. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26028 from peter-toth/SPARK-29359-better-exception-handling. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-12 22:17:37 -07:00
Maxim Gekk	d193248205	[SPARK-29368][SQL][TEST] Port interval.sql ### What changes were proposed in this pull request? This PR is to port interval.sql from PostgreSQL regression tests: https://raw.githubusercontent.com/postgres/postgres/REL_12_STABLE/src/test/regress/sql/interval.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/interval.out When porting the test cases, found PostgreSQL specific features below that do not exist in Spark SQL: - [SPARK-29369](https://issues.apache.org/jira/browse/SPARK-29369): Accept strings without `interval` prefix in casting to intervals - [SPARK-29370](https://issues.apache.org/jira/browse/SPARK-29370): Interval strings without explicit unit markings - [SPARK-29371](https://issues.apache.org/jira/browse/SPARK-29371): Support interval field values with fractional parts - [SPARK-29382](https://issues.apache.org/jira/browse/SPARK-29382): Support the `INTERVAL` type by Parquet datasource - [SPARK-29383](https://issues.apache.org/jira/browse/SPARK-29383): Support the optional prefix `` in interval strings - [SPARK-29384](https://issues.apache.org/jira/browse/SPARK-29384): Support `ago` in interval strings - [SPARK-29385](https://issues.apache.org/jira/browse/SPARK-29385): Make `INTERVAL` values comparable - [SPARK-29386](https://issues.apache.org/jira/browse/SPARK-29386): Copy data between a file and a table - [SPARK-29387](https://issues.apache.org/jira/browse/SPARK-29387): Support `*` and `\` operators for intervals - [SPARK-29388](https://issues.apache.org/jira/browse/SPARK-29388): Construct intervals from the `millenniums`, `centuries` or `decades` units - [SPARK-29389](https://issues.apache.org/jira/browse/SPARK-29389): Support synonyms for interval units - [SPARK-29390](https://issues.apache.org/jira/browse/SPARK-29390): Add the justify_days(), justify_hours() and justify_interval() functions - [SPARK-29391](https://issues.apache.org/jira/browse/SPARK-29391): Default year-month units - [SPARK-29393](https://issues.apache.org/jira/browse/SPARK-29393): Add the make_interval() function - [SPARK-29394](https://issues.apache.org/jira/browse/SPARK-29394): Support ISO 8601 format for intervals - [SPARK-29395](https://issues.apache.org/jira/browse/SPARK-29395): Precision of the interval type - [SPARK-29406](https://issues.apache.org/jira/browse/SPARK-29406): Interval output styles - [SPARK-29407](https://issues.apache.org/jira/browse/SPARK-29407): Support syntax for zero interval - [SPARK-29408](https://issues.apache.org/jira/browse/SPARK-29408): Support interval literal with negative sign `-` ### Why are the changes needed? To improve the test coverage, see https://issues.apache.org/jira/browse/SPARK-27763 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By manually comparing Spark results with PostgreSQL Closes #26055 from MaxGekk/port-interval-sql. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-12 17:44:40 -07:00
Maxim Gekk	f302c2ee62	[SPARK-29328][SQL][FOLLOWUP] Revert calculation of mean seconds per month ### What changes were proposed in this pull request? Revert this commit `18b7ad2fc5`. ### Why are the changes needed? See https://github.com/apache/spark/pull/16304#discussion_r92753590 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? There is no test for that. Closes #26101 from MaxGekk/revert-mean-seconds-per-month. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-12 09:38:08 -05:00
Sean Owen	cc7493fa21	[SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) ### What changes were proposed in this pull request? Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place. ### Why are the changes needed? To compile with 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26073 from srowen/SPARK-29416. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 17:00:48 -07:00
Sean Owen	fa95a5c395	[SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 ### What changes were proposed in this pull request? Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object. ### Why are the changes needed? It doesn't compile otherwise in Scala 2.13. - https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30 ### Does this PR introduce any user-facing change? Should be no behavior change at all. ### How was this patch tested? Existing tests. Closes #26070 from srowen/SPARK-29411. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:24:13 -07:00
herman	ba4d413fc9	[SPARK-29346][SQL] Add Aggregating Accumulator ### What changes were proposed in this pull request? This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time. The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output. Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`. ### Why are the changes needed? This is the mechanism we are going to use to implement observable metrics. ### Does this PR introduce any user-facing change? No, not yet. ### How was this patch tested? Added `AggregatingAccumulator` test suite. Closes #26012 from hvanhovell/SPARK-29346. Authored-by: herman <herman@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-10-09 16:05:14 +02:00
Terry Kim	a927f1aefc	[SPARK-29373][SQL] DataSourceV2: Commands should not submit a spark job ### What changes were proposed in this pull request? DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all extend LeafExecNode. This results in running a job when executeCollect() is called. This breaks the previous behavior [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650). A new command physical operator will be introduced form which all V2 Exec classes derive to avoid running a job. ### Why are the changes needed? It is a bug since the current behavior runs a spark job, which breaks the existing behavior: [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #26048 from imback82/dsv2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-09 11:44:25 +08:00
Sean Owen	ee83d09b53	[SPARK-29401][CORE][ML][SQL][GRAPHX][TESTS] Replace calls to .parallelize Arrays of tuples, ambiguous in Scala 2.13, with Seqs of tuples ### What changes were proposed in this pull request? Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like: ``` [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives: (x: Unit,xs: Unit)Array[Unit] <and> (x: Double,xs: Double)Array[Double] <and> (x: Float,xs: Float)Array[Float] <and> (x: Long,xs: Long)Array[Long] <and> (x: Int,xs: Int)Array[Int] <and> (x: Char,xs: Char)Array[Char] <and> (x: Short,xs: Short)Array[Short] <and> (x: Byte,xs: Byte)Array[Byte] <and> (x: Boolean,xs: Boolean*)Array[Boolean] cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) ``` Using a `Seq` instead appears to resolve it, and is effectively equivalent. ### Why are the changes needed? To better cross-build for 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26062 from srowen/SPARK-29401. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:22:02 -07:00
Sean Owen	2d871ad0e7	[SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo") ### What changes were proposed in this pull request? Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")` ### Why are the changes needed? Avoids ~50 deprecation warnings when attempting to build with 2.13. ### Does this PR introduce any user-facing change? None, should be no functional change at all. ### How was this patch tested? Existing tests. Closes #26061 from srowen/SPARK-29392. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:15:37 -07:00
Guilherme	de360e96d7	[SPARK-29336][SQL] Fix the implementation of QuantileSummaries.merge (guarantee that the relativeError will be respected) ### What changes were proposed in this pull request? Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug. ### Why are the changes needed? The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified. ### Does this PR introduce any user-facing change? Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example: ```scala var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } ``` In the current build it returns: ``` 16 12 10 11 17 ``` I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please? ### How was this patch tested? A new unit test was added to uncover the previous bug. Closes #26029 from sitegui/SPARK-29336. Authored-by: Guilherme <sitegui@sitegui.com.br> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-08 08:11:10 -05:00
Dilip Biswal	ef1e8495ba	[SPARK-29366][SQL] Subqueries created for DPP are not printed in EXPLAIN FORMATTED ### What changes were proposed in this pull request? The subquery expressions introduced by DPP are not printed in the newer explain command. This PR fixes the code that computes the list of subqueries in the plan. SQL df1 and df2 are partitioned on k. ``` SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k AND df2.id < 2 ``` Before ``` \|== Physical Plan == * Project (9) +- * BroadcastHashJoin Inner BuildRight (8) :- * ColumnarToRow (2) : +- Scan parquet default.df1 (1) +- BroadcastExchange (7) +- * Project (6) +- * Filter (5) +- * ColumnarToRow (4) +- Scan parquet default.df2 (3) (1) Scan parquet default.df1 Output: [id#19L, k#20L] (2) ColumnarToRow [codegen id : 2] Input: [id#19L, k#20L] (3) Scan parquet default.df2 Output: [id#21L, k#22L] (4) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (5) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (6) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (7) BroadcastExchange Input: [k#22L] (8) BroadcastHashJoin [codegen id : 2] Left keys: List(k#20L) Right keys: List(k#22L) Join condition: None (9) Project [codegen id : 2] Output : [id#19L, k#22L] Input : [id#19L, k#20L, k#22L] ``` After ``` \|== Physical Plan == * Project (9) +- * BroadcastHashJoin Inner BuildRight (8) :- * ColumnarToRow (2) : +- Scan parquet default.df1 (1) +- BroadcastExchange (7) +- * Project (6) +- * Filter (5) +- * ColumnarToRow (4) +- Scan parquet default.df2 (3) (1) Scan parquet default.df1 Output: [id#19L, k#20L] (2) ColumnarToRow [codegen id : 2] Input: [id#19L, k#20L] (3) Scan parquet default.df2 Output: [id#21L, k#22L] (4) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (5) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (6) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (7) BroadcastExchange Input: [k#22L] (8) BroadcastHashJoin [codegen id : 2] Left keys: List(k#20L) Right keys: List(k#22L) Join condition: None (9) Project [codegen id : 2] Output : [id#19L, k#22L] Input : [id#19L, k#20L, k#22L] ===== Subqueries ===== Subquery:1 Hosting operator id = 1 Hosting Expression = k#20L IN subquery25 * HashAggregate (16) +- Exchange (15) +- * HashAggregate (14) +- * Project (13) +- * Filter (12) +- * ColumnarToRow (11) +- Scan parquet default.df2 (10) (10) Scan parquet default.df2 Output: [id#21L, k#22L] (11) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (12) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (13) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (14) HashAggregate [codegen id : 1] Input: [k#22L] (15) Exchange Input: [k#22L] (16) HashAggregate [codegen id : 2] Input: [k#22L] ``` ### Why are the changes needed? Without the fix, the subqueries are not printed in the explain plan. ### Does this PR introduce any user-facing change? Yes. the explain output will be different. ### How was this patch tested? Added a test case in ExplainSuite. Closes #26039 from dilipbiswal/explain_subquery_issue. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:39:05 -07:00
Wenchen Fan	948a6e80fe	[SPARK-28892][SQL][FOLLOWUP] add resolved logical plan for UPDATE TABLE ### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later. ### Why are the changes needed? In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:36:26 -07:00
Dongjoon Hyun	cb501771fa	[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method ### What changes were proposed in this pull request? This PR aims the followings. - Refactor `TPCDSQueryBenchmark` to use main method to improve the usability. - Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that). - Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.) - AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used. This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following. ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` ### Why are the changes needed? Although the generated TPCDS data is random, we can keep the record. ### Does this PR introduce any user-facing change? No. (This is dev-only test benchmark). ### How was this patch tested? Manually run the benchmark. Please note that you need to have TPCDS data. ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1" ``` Closes #26049 from dongjoon-hyun/SPARK-25668. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-08 13:33:42 +09:00
gwang3	64fe82b519	[SPARK-29189][SQL] Add an option to ignore block locations when listing file ### What changes were proposed in this pull request? In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable. And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds. To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations. ### Why are the changes needed? And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement. Table Size \| Total File Number \| Total Block Number \| List File Duration(With Block Location) \| List File Duration(Without Block Location) -- \| -- \| -- \| -- \| -- 22.6T \| 30000 \| 120000 \| 16.841s \| 1.730s 28.8 T \| 42001 \| 148964 \| 10.099s \| 2.858s 3.4 T \| 20000 \| 20000 \| 5.833s \| 4.881s ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Via ut. Closes #25869 from wangshisan/SPARK-29189. Authored-by: gwang3 <gwang3@ebay.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-10-07 14:52:55 -05:00
Maxim Gekk	18b7ad2fc5	[SPARK-29328][SQL] Fix calculation of mean seconds per month ### What changes were proposed in this pull request? I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year. ### Why are the changes needed? Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth. ### Does this PR introduce any user-facing change? Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases. Before: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.4516129 ``` After: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.45996838 ``` ### How was this patch tested? By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #25998 from MaxGekk/days-in-year. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 08:47:46 -05:00
Maxim Gekk	932e2619ce	[SPARK-29365][SQL] Support dates and timestamps subtraction ### What changes were proposed in this pull request? Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp. ### Why are the changes needed? - To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date: ```sql maxim=# select timestamp'now' - date'epoch'; ?column? ---------------------------- 18175 days 21:07:33.412875 (1 row) maxim=# select date'2020-01-01' - timestamp'now'; ?column? ------------------------- 86 days 02:52:00.945296 (1 row) ``` - To conform to the SQL standard which defines datetime subtraction as an interval. ### Does this PR introduce any user-facing change? Yes, currently the queries bellow fails with the error: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7; 'Project [unresolvedalias((1570385107234000 - 18170), None)] +- OneRowRelation ``` after the changes: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds ``` ### How was this patch tested? - Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite` - by 2 new test in `datetime.sql` Closes #26036 from MaxGekk/date-timestamp-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-07 16:47:00 +09:00
Yuanjian Li	130e9ae5dc	[SPARK-29357][SQL][TESTS] Fix flaky test by changing to use AtomicLong ### What changes were proposed in this pull request? Change to use AtomicLong instead of a var in the test. ### Why are the changes needed? Fix flaky test. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26020 from xuanyuanking/SPARK-25159. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 10:11:31 -07:00
Maxim Gekk	eecef75350	[SPARK-29355][SQL] Support timestamps subtraction ### What changes were proposed in this pull request? Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`. ### Why are the changes needed? To maintain feature parity with PostgreSQL which allows to get timestamp difference: ```sql # select timestamp'today' - timestamp'yesterday'; ?column? ---------- 1 day (1 row) ``` ### Does this PR introduce any user-facing change? Yes, previously users got the following error from timestamp subtraction: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7; 'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)] +- OneRowRelation ``` after the changes they should get an interval: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; interval 1 days ``` ### How was this patch tested? - Added tests for `TimestampDiff` to `DateExpressionsSuite` - By new test in `TypeCoercionSuite`. - Enabled tests in `postgreSQL/timestamp.sql`. Closes #26022 from MaxGekk/timestamp-diff. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 09:39:19 -07:00
Wenchen Fan	275e044ba8	[SPARK-29039][SQL] centralize the catalog and table lookup logic ### What changes were proposed in this pull request? Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is 1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`. 2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`. However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places. This PR centralizes the catalog/table lookup logic by 3 rules. 1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command. 2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`. 3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore. ### Why are the changes needed? Reduce duplicated code and make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25747 from cloud-fan/lookup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 16:21:13 +08:00
Yuanjian Li	93289b54f5	[SPARK-29203][TESTS][MINOR][FOLLOW UP] Add access modifier for sparkConf in SQLQueryTestSuite ### What changes were proposed in this pull request? Add access modifier `protected` for `sparkConf` in SQLQueryTestSuite, because in the parent trait SharedSparkSession, it is protected. ### Why are the changes needed? Code consistency. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26019 from xuanyuanking/SPARK-29203. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-04 16:54:47 +09:00
Gengliang Wang	91747bd91b	[SPARK-29326][SQL] ANSI store assignment policy: throw exception on casting failure ### What changes were proposed in this pull request? 1. With ANSI store assignment policy, an exception is thrown on casting failure 2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`. ### Why are the changes needed? As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field. ### Does this PR introduce any user-facing change? With ANSI store assignment policy, an exception is thrown on casting failure ### How was this patch tested? Unit test Closes #25997 from gengliangwang/newCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 15:53:38 +08:00
maryannxue	8fabbab299	[SPARK-29350] Fix BroadcastExchange reuse in Dynamic Partition Pruning ### What changes were proposed in this pull request? Dynamic partition pruning filters are added as an in-subquery containing a `BroadcastExchangeExec` in case of a broadcast hash join. This PR makes the `ReuseExchange` rule visit in-subquery nodes, to ensure the new `BroadcastExchangeExec` added by dynamic partition pruning can be reused. ### Why are the changes needed? This initial dynamic partition pruning PR did not enable this reuse, which means a broadcast exchange would be executed twice, in the main query and in the DPP filter. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added broadcast exchange reuse check in `DynamicPartitionPruningSuite` Closes #26015 from maryannxue/exchange-reuse. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-03 16:11:32 -07:00
Nik Vanderhoof	6f687691ef	[SPARK-28962][SPARK-27297][SQL] Add overload for filter with index to functions object ### What changes were proposed in this pull request? Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. ### Why are the changes needed? See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: https://github.com/apache/spark/pull/24232#issuecomment-533288653 ### Does this PR introduce any user-facing change? ### How was this patch tested? Updated the these test suites: `test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite` and `org.apache.spark.sql.DataFrameFunctionsSuite` Closes #26007 from nvander1/add_index_overload_for_filter. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-03 11:12:14 -07:00
Dongjoon Hyun	4e0e4e51c4	[MINOR][TESTS] Rename JSONBenchmark to JsonBenchmark ### What changes were proposed in this pull request? This PR renames `object JSONBenchmark` to `object JsonBenchmark` and the benchmark result file `JSONBenchmark-results.txt` to `JsonBenchmark-results.txt`. ### Why are the changes needed? Since the file name doesn't match with `object JSONBenchmark`, it makes a confusion when we run the benchmark. In addition, this makes the automation difficult. ``` $ find . -name JsonBenchmark.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala ``` ``` $ build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JsonBenchmark" [info] Running org.apache.spark.sql.execution.datasources.json.JsonBenchmark [error] Error: Could not find or load main class org.apache.spark.sql.execution.datasources.json.JsonBenchmark ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is just renaming. Closes #26008 from dongjoon-hyun/SPARK-RENAME-JSON. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 09:02:06 -07:00
Dongjoon Hyun	854a0f752e	[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. A. EXPECTED CASES(JDK11 is faster in general) - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) B. CASES WE NEED TO INVESTIGATE MORE LATER - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 08:58:25 -07:00
Sean Owen	7aca0dd658	[SPARK-29296][BUILD][CORE] Remove use of .par to make 2.13 support easier; add scala-2.13 profile to enable pulling in par collections library separately, for the future ### What changes were proposed in this pull request? Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use. However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely. ### Why are the changes needed? To compile for 2.13 and later to work with 2.13. ### Does this PR introduce any user-facing change? Should not, no. ### How was this patch tested? Existing tests. Closes #25980 from srowen/SPARK-29296. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-03 08:56:08 -05:00
s71955	ee66890f30	[SPARK-28084][SQL] Resolving the partition column name based on the resolver in sql load command ### What changes were proposed in this pull request? LOAD DATA command resolves the partition column name as case sensitive manner, where as in insert commandthe partition column name will be resolved using the SQLConf resolver where the names will be resolved based on `spark.sql.caseSensitive` property. Same logic can be applied for resolving the partition column names in LOAD COMMAND. ### Why are the changes needed? It's to handle the partition column name correctly according to the configuration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT and manual testing. Closes #24903 from sujith71955/master_paritionColName. Lead-authored-by: s71955 <sujithchacko.2010@gmail.com> Co-authored-by: sujith71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 01:11:48 -07:00
HyukjinKwon	40485f4656	[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan ### What changes were proposed in this pull request? This PR proposes to avoid abstract classes introduced at https://github.com/apache/spark/pull/24965 but instead uses trait and object. - `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in Before: ``` BasePythonRunner ├── BaseArrowPythonRunner │ ├── ArrowPythonRunner │ └── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` After: ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` - `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple Before: ``` └── BasePandasGroupExec ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` After: ``` ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` ### Why are the changes needed? The problem is that R code path is being matched with Python side: Python: ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` R: ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally. `BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs. `FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec` `MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec` In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Locally tested existing tests. Jenkins tests should verify this too. Closes #25989 from HyukjinKwon/SPARK-29317. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-03 16:42:37 +09:00
Henry D	51d6ba7490	[SPARK-28962][SQL] Provide index argument to filter lambda functions ### What changes were proposed in this pull request? Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`. ### Why are the changes needed? See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays. ### Does this PR introduce any user-facing change? Previously filter lambdas had to look like `filter(arr, el -> whatever)` Now, lambdas can take an index argument as well `filter(array, (el, idx) -> whatever)` ### How was this patch tested? I added unit tests to `HigherOrderFunctionsSuite`. Closes #25666 from henrydavidge/filter-idx. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 13:03:06 -07:00
Nik Vanderhoof	730a17823f	[SPARK-27297][SQL] Add higher order functions to scala API ## What changes were proposed in this pull request? There is currently no existing Scala API equivalent for the higher order functions introduced in Spark 2.4.0. * transform * aggregate * filter * exists * forall * zip_with * map_zip_with * map_filter * transform_values * transform_keys Equivalent column based functions should be added to the Scala API for org.apache.spark.sql.functions with the following signatures: ```scala def transform(column: Column, f: Column => Column): Column = ??? def transform(column: Column, f: (Column, Column) => Column): Column = ??? def exists(column: Column, f: Column => Column): Column = ??? def filter(column: Column, f: Column => Column): Column = ??? def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column, finish: Column => Column): Column = ??? def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column): Column = ??? def zip_with( left: Column, right: Column, f: (Column, Column) => Column): Column = ??? def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ??? def transform_values(expr: Column, f: (Column, Column) => Column): Column = ??? def map_filter(expr: Column, f: (Column, Column) => Column): Column = ??? def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column = ??? ``` ## How was this patch tested? I've mimicked the existing tests for the higher order functions in `org.apache.spark.sql.DataFrameFunctionsSuite` that use `expr` to test the higher order functions. As an example of an existing test: ```scala test("map_zip_with function - map of primitive types") { val df = Seq( (Map(8 -> 6L, 3 -> 5L, 6 -> 2L), Map[Integer, Integer]((6, 4), (8, 2), (3, 2))), (Map(10 -> 6L, 8 -> 3L), Map[Integer, Integer]((8, 4), (4, null))), (Map.empty[Int, Long], Map[Integer, Integer]((5, 1))), (Map(5 -> 1L), null) ).toDF("m1", "m2") checkAnswer(df.selectExpr("map_zip_with(m1, m2, (k, v1, v2) -> k == v1 + v2)"), Seq( Row(Map(8 -> true, 3 -> false, 6 -> true)), Row(Map(10 -> null, 8 -> false, 4 -> null)), Row(Map(5 -> null)), Row(null))) } ``` I've added this test that performs the same logic, but with the new column based API I've added. ```scala checkAnswer(df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2)), Seq( Row(Map(8 -> true, 3 -> false, 6 -> true)), Row(Map(10 -> null, 8 -> false, 4 -> null)), Row(Map(5 -> null)), Row(null))) ``` Closes #24232 from nvander1/feature/add_higher_order_functions_to_scala_api. Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Co-authored-by: Nik <nikolasrvanderhoof@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 12:53:39 -07:00
Terry Kim	f2ead4d0b5	[SPARK-28970][SQL] Implement USE CATALOG/NAMESPACE for Data Source V2 ### What changes were proposed in this pull request? This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#) It also exposes `currentCatalog` in `CatalogManager`. Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog). ### Why are the changes needed? There is currently no mechanism to change current catalog/namespace thru SQL commands. ### Does this PR introduce any user-facing change? Yes, you can perform the following: ```scala // Sets the current catalog to 'testcat' spark.sql("USE CATALOG testcat") // Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'. spark.sql("USE ns1.ns2 IN testcat") // Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace. spark.sql("SHOW NAMESPACES") ``` ### How was this patch tested? Added new unit tests. Closes #25771 from imback82/use_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-02 21:55:21 +08:00
Maxim Gekk	3b1674cb1f	[SPARK-29313][SQL] Fix failure on writing to `noop` in benchmarks ### What changes were proposed in this pull request? In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks: - JsonBenchmark - CSVBenchmark - UDFBenchmark - MakeDateTimeBenchmark - ExtractBenchmark - DateTimeBenchmark - NestedSchemaPruningBenchmark ### Why are the changes needed? Otherwise writing to `noop` fails with: ``` [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.; [error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284) ``` most likely due to https://github.com/apache/spark/pull/25876 ### Does this PR introduce any user-facing change? No ### How was this patch tested? I generated results of `ExtractBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark" ``` Closes #25988 from MaxGekk/noop-overwrite-mode. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-01 21:04:56 -07:00

1 2 3 4 5 ...

6027 commits