Commit graph

9495 commits

Author SHA1 Message Date
Kent Yao 311fe6a880 [SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in DateExpressionsSuite
### What changes were proposed in this pull request?

This PR modifies some codegen related tests to test escape characters for datetime functions which are time zone aware. If the timezone is absent, the formatter could result in `null` caused by `java.util.NoSuchElementException: None.get` and bypassing the real intention of those test cases.

### Why are the changes needed?

fix tests

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing the modified test cases.

Closes #28653 from yaooqinn/SPARK-31835.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 17:26:07 +00:00
Ali Afroozeh f6f1e51072 [SPARK-31719][SQL] Refactor JoinSelection
### What changes were proposed in this pull request?
This PR extracts the logic for selecting the planned join type out of the `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst.

### Why are the changes needed?
This change both cleans up the code in `JoinSelection` and allows the logic to be in one place and be used from other rules that need to make decision based on the join type before the planning time.

### Does this PR introduce _any_ user-facing change?
`BuildSide`, `BuildLeft`, and `BuildRight` are moved from `org.apache.spark.sql.execution` to Catalyst in `org.apache.spark.sql.catalyst.optimizer`.

### How was this patch tested?
This is a refactoring, passes existing tests.

Closes #28540 from dbaliafroozeh/RefactorJoinSelection.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 15:49:08 +00:00
iRakson 765105b6f1 [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages
### What changes were proposed in this pull request?

Pagination code across pages needs to be cleaned.
I have tried to clear out these things :
* Unused methods
* Unused method arguments
* remove redundant `if` expressions
* fix indentation

### Why are the changes needed?
This fix will make code more readable and remove unnecessary methods and variables.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually

Closes #28448 from iRakson/refactorPagination.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-27 08:59:08 -05:00
beliefer 8f2b6f3a0b [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `ln`
- `rint`
- `lcase`
- `position`

### Why are the changes needed?
Improve the implement of some expression.

### Does this PR introduce _any_ user-facing change?
 'Yes'. This PR will let user see the correct alias in schema.

### How was this patch tested?
Jenkins test.

Closes #28551 from beliefer/show-correct-alias-in-schema.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-27 15:05:06 +09:00
Max Gekk 87d34e6b96 [SPARK-31820][SQL][TESTS] Fix flaky JavaBeanDeserializationSuite
### What changes were proposed in this pull request?
Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure:
- should be `2020-05-25 12:39:17`
- but incorrect expected string is `2020-05-25 12:39:17.0`

### Why are the changes needed?
To make `JavaBeanDeserializationSuite` stable, and avoid test failures like https://github.com/apache/spark/pull/28630#issuecomment-633695723

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I changed 7dff3b125d/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java (L207) to
```java
new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000),
```
to force zero seconds fraction.

Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-26 12:13:28 +00:00
Dilip Biswal b44acee953 [SPARK-31673][SQL] QueryExection.debug.toFile() to take an addtional explain mode param
### What changes were proposed in this pull request?
Currently QueryExecution.debug.toFile dumps the query plan information in a fixed format. This PR adds an additional explain mode parameter that writes the debug information as per the user supplied format.
```
df.queryExecution.debug.toFile("/tmp/plan.txt", explainMode = ExplainMode.fromString("formatted"))
```
```
== Physical Plan ==
* Filter (2)
+- Scan hive default.s1 (1)

(1) Scan hive default.s1
Output [2]: [c1#15, c2#16]
Arguments: [c1#15, c2#16], HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#15, c2#16]

(2) Filter [codegen id : 1]
Input [2]: [c1#15, c2#16]
Condition : (isnotnull(c1#15) AND (c1#15 > 0))

== Whole Stage Codegen ==
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 (maxMethodCodeSize:220; maxConstantPoolSize:105(0.16% used); numInnerClasses:0) ==
*(1) Filter (isnotnull(c1#15) AND (c1#15 > 0))
+- Scan hive default.s1 [c1#15, c2#16], HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#15, c2#16]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private scala.collection.Iterator inputadapter_input_0;
/* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 011 */
/* 012 */   public GeneratedIteratorForCodegenStage1(Object[] references) {
/* 013 */     this.references = references;
/* 014 */   }
/* 015 */
/* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 017 */     partitionIndex = index;
/* 018 */     this.inputs = inputs;
/* 019 */     inputadapter_input_0 = inputs[0];
/* 020 */     filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0);
/* 021 */
/* 022 */   }
/* 023 */
/* 024 */   protected void processNext() throws java.io.IOException {
/* 025 */     while ( inputadapter_input_0.hasNext()) {
/* 026 */       InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next();
/* 027 */
/* 028 */       do {
/* 029 */         boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
/* 030 */         int inputadapter_value_0 = inputadapter_isNull_0 ?
/* 031 */         -1 : (inputadapter_row_0.getInt(0));
/* 032 */
/* 033 */         boolean filter_value_2 = !inputadapter_isNull_0;
/* 034 */         if (!filter_value_2) continue;
/* 035 */
/* 036 */         boolean filter_value_3 = false;
/* 037 */         filter_value_3 = inputadapter_value_0 > 0;
/* 038 */         if (!filter_value_3) continue;
/* 039 */
/* 040 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 041 */
/* 042 */         boolean inputadapter_isNull_1 = inputadapter_row_0.isNullAt(1);
/* 043 */         int inputadapter_value_1 = inputadapter_isNull_1 ?
/* 044 */         -1 : (inputadapter_row_0.getInt(1));
/* 045 */         filter_mutableStateArray_0[0].reset();
/* 046 */
/* 047 */         filter_mutableStateArray_0[0].zeroOutNullBytes();
/* 048 */
/* 049 */         filter_mutableStateArray_0[0].write(0, inputadapter_value_0);
/* 050 */
/* 051 */         if (inputadapter_isNull_1) {
/* 052 */           filter_mutableStateArray_0[0].setNullAt(1);
/* 053 */         } else {
/* 054 */           filter_mutableStateArray_0[0].write(1, inputadapter_value_1);
/* 055 */         }
/* 056 */         append((filter_mutableStateArray_0[0].getRow()));
/* 057 */
/* 058 */       } while(false);
/* 059 */       if (shouldStop()) return;
/* 060 */     }
/* 061 */   }
/* 062 */
/* 063 */ }
```
### Why are the changes needed?
Hopefully enhances the usability of debug.toFile(..)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test in QueryExecutionSuite

Closes #28493 from dilipbiswal/write_to_file.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-26 14:40:58 +09:00
Max Gekk 7e4f5bbd8a [SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version
### What changes were proposed in this pull request?
1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`:
   - Files saved by Spark 2.4.5 (cee4ecbb16) without meta info `org.apache.spark.version`
      - `before_1582_date_v2_4_5.snappy.parquet` with 2 date columns of the type **INT32 L:DATE** - `PLAIN` (8 date values of `1001-01-01`) and `PLAIN_DICTIONARY` (`1001-01-01`..`1001-01-08`).
      - `before_1582_timestamp_micros_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MICROS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`).
      - `before_1582_timestamp_millis_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MILLIS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123`..`1001-01-08 01:02:03.123`).
      - `before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`).
      - `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN_DICTIONARY` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`).
    - Files saved by Spark 2.4.6-rc3 (570848da7c) with the meta info `org.apache.spark.version = 2.4.6`:
      - `before_1582_date_v2_4_6.snappy.parquet` replaces `before_1582_date_v2_4.snappy.parquet`. And it is similar to `before_1582_date_v2_4_5.snappy.parquet` except Spark version in parquet meta info.
      - `before_1582_timestamp_micros_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_micros_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_micros_v2_4_5.snappy.parquet` except meta info.
      - `before_1582_timestamp_millis_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_millis_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_millis_v2_4_5.snappy.parquet` except meta info.
      - `before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet` is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info.
      - `before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_int96_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info.
2. Add new test "generate test files for checking compatibility with Spark 2.4" to `ParquetIOSuite` (marked as ignored). The parquet files above were generated by this test.
3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `ParquetIOSuite` to use new parquet files.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetIOSuite`.

Closes #28630 from MaxGekk/parquet-files-update.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-26 05:15:51 +00:00
Prakhar Jain 452594f5a4 [SPARK-31810][TEST] Fix AlterTableRecoverPartitions test using incorrect api to modify RDD_PARALLEL_LISTING_THRESHOLD
### What changes were proposed in this pull request?
Use the correct API in AlterTableRecoverPartition tests to modify the `RDD_PARALLEL_LISTING_THRESHOLD` conf.

### Why are the changes needed?
The existing AlterTableRecoverPartitions test modify the RDD_PARALLEL_LISTING_THRESHOLD as a SQLConf using the withSQLConf API. But since, this is not a SQLConf, it is not overridden and so the test doesn't end up testing the required behaviour.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This is UT Fix. UTs are still passing after the fix.

Closes #28634 from prakharjain09/SPARK-31810-fix-recover-partitions.

Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-26 14:13:02 +09:00
HyukjinKwon df2a1fe131
[SPARK-31808][SQL] Makes struct function's output name and class name pretty
### What changes were proposed in this pull request?

This PR proposes to set the alias, and class name in its `ExpressionInfo` for `struct`.
- Class name in `ExpressionInfo`
  - from: `org.apache.spark.sql.catalyst.expressions.NamedStruct`
  - to:`org.apache.spark.sql.catalyst.expressions.CreateNamedStruct`
- Alias name: `named_struct(col1, v, ...)` -> `struct(v, ...)`

This PR takes over https://github.com/apache/spark/pull/28631

### Why are the changes needed?

To show the correct output name and class names to users.

### Does this PR introduce _any_ user-facing change?

Yes.

**Before:**

```scala
scala> sql("DESC FUNCTION struct").show(false)
+------------------------------------------------------------------------------------+
|function_desc                                                                       |
+------------------------------------------------------------------------------------+
|Function: struct                                                                    |
|Class: org.apache.spark.sql.catalyst.expressions.NamedStruct                        |
|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.|
+------------------------------------------------------------------------------------+
```

```scala
scala> sql("SELECT struct(1, 2)").show(false)
+------------------------------+
|named_struct(col1, 1, col2, 2)|
+------------------------------+
|[1, 2]                        |
+------------------------------+
```

**After:**

```scala
scala> sql("DESC FUNCTION struct").show(false)
+------------------------------------------------------------------------------------+
|function_desc                                                                       |
+------------------------------------------------------------------------------------+
|Function: struct                                                                    |
|Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct                  |
|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.|
+------------------------------------------------------------------------------------+
```

```scala
scala> sql("SELECT struct(1, 2)").show(false)
+------------+
|struct(1, 2)|
+------------+
|[1, 2]      |
+------------+
```

### How was this patch tested?

Manually tested, and Jenkins tests.

Closes #28633 from HyukjinKwon/SPARK-31808.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-25 20:36:00 -07:00
Max Gekk 6c80ebbccb
[SPARK-31818][SQL] Fix pushing down filters with java.time.Instant values in ORC
### What changes were proposed in this pull request?
Convert `java.time.Instant` to `java.sql.Timestamp` in pushed down filters to ORC datasource when Java 8 time API enabled.

### Why are the changes needed?
The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`:
```
java.lang.IllegalArgumentException: Wrong value class java.time.Instant for TIMESTAMP.EQUALS leaf
 at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
 at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Added tests to `OrcFilterSuite`.

Closes #28636 from MaxGekk/orc-timestamp-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-25 18:36:02 -07:00
Kent Yao 695cb617d4 [SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
### What changes were proposed in this pull request?

Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style.

In this PR, we explicitly disable Narrow-Text Style for these pattern characters.

### Why are the changes needed?

Without this change, there will be a silent data change.

### Does this PR introduce _any_ user-facing change?

Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters.

1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns

2, datetime patterns that are not supported by both the new parser and the legacy, e.g.  "QQQQQ", "qqqqq",  will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users.

### How was this patch tested?

add unit tests

Closes #28592 from yaooqinn/SPARK-31771.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 15:07:41 +00:00
Max Gekk 92685c0148 [SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results
### What changes were proposed in this pull request?
Re-generate results of:
- DateTimeBenchmark
- CSVBenchmark
- JsonBenchmark

in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

### Why are the changes needed?
1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing.
2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running benchmarks via the script:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #28613 from MaxGekk/missing-hour-year-benchmarks.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 15:00:11 +00:00
Kent Yao 0df8dd6073 [SPARK-30352][SQL] DataSourceV2: Add CURRENT_CATALOG function
### What changes were proposed in this pull request?

As we support multiple catalogs with DataSourceV2, we may need the `CURRENT_CATALOG` value expression from the SQL standard.

`CURRENT_CATALOG` is a general value specification in the SQL Standard, described as:

> The value specified by CURRENT_CATALOG is the character string that represents the current default catalog name.

### Why are the changes needed?
improve catalog v2 with ANSI SQL standard.

### Does this PR introduce any user-facing change?
yes, add a new function `current_catalog()` to point the current active catalog

### How was this patch tested?

add ut

Closes #27006 from yaooqinn/SPARK-30352.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 14:27:47 +00:00
Max Gekk 7f36310500 [SPARK-31802][SQL] Format Java date-time types in Row.jsonValue directly
### What changes were proposed in this pull request?
Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR https://github.com/apache/spark/pull/28582 added the methods to avoid conversions to days and microseconds.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests in `RowJsonSuite`.

Closes #28620 from MaxGekk/toJson-format-Java-datetime-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-25 12:50:38 +09:00
rishi b90e10c546 [SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite
### What changes were proposed in this pull request?
Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows.
- ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
- BroadcastNestedLoopJoin: RightOuter
- BroadcastHashJoin: LeftAnti

### Why are the changes needed?
For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered.
I further used Intellij debugger to test the same.

Closes #28330 from sririshindra/SPARK-31377.

Authored-by: rishi <spothireddi@cloudera.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-25 12:44:14 +09:00
sandeep katta cf7463f309 [SPARK-31761][SQL] cast integer to Long to avoid IntegerOverflow for IntegralDivide operator
### What changes were proposed in this pull request?
`IntegralDivide` operator returns Long DataType, so integer overflow case should be handled.
If the operands are of type Int it will be casted to Long

### Why are the changes needed?
As `IntegralDivide` returns Long datatype, integer overflow should not happen

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT and also tested in the local cluster

After fix

![image](https://user-images.githubusercontent.com/35216143/82603361-25eccc00-9bd0-11ea-9ca7-001c539e628b.png)

SQL Test

After fix
![image](https://user-images.githubusercontent.com/35216143/82637689-f0250300-9c22-11ea-85c3-886ab2c23471.png)

Before Fix
![image](https://user-images.githubusercontent.com/35216143/82637984-878a5600-9c23-11ea-9e47-5ce2fb923c01.png)

Closes #28600 from sandeep-katta/integerOverFlow.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-24 14:50:11 +09:00
Gengliang Wang 9fdc2a0801 [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata
### What changes were proposed in this pull request?

Currently, the data source scan node stores all the paths in its metadata. The metadata is kept when a SparkPlan is converted into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph in UI.

However, the paths can be very large (e.g. it can be many partitions after partition pruning), while UI pages only require up to 100 bytes for the location metadata. We can reduce the paths stored in metadata to reduce memory usage.

### Why are the changes needed?

Reduce unnecessary memory cost.
In the heap dump of a driver, the SparkPlanInfo instances are quite large and it should be avoided:
![image](https://user-images.githubusercontent.com/1097932/82642318-8f65de00-9bc2-11ea-9c9c-f05c2b0e1c49.png)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28610 from gengliangwang/improveLocationMetadata.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-05-23 15:00:28 -07:00
iRakson fbb3144a9c [SPARK-31642] Add Pagination Support for Structured Streaming Page
### What changes were proposed in this pull request?
Add Pagination Support for structured streaming page. Now both tables `Active Queries` and `Completed Queries` will have pagination.
To implement pagination, pagination framework from #7399  is used.
* Also tables will only be shown if there is at least one entry in the table.

### Why are the changes needed?
* This will help users in analysing their structured streaming queries in much better way.
* Other Web UI pages support pagination in their table. So this will make web UI more consistent across pages.
* This can prevent potential OOM errors.

### Does this PR introduce _any_ user-facing change?
Yes. Both tables will support pagination.

### How was this patch tested?
Manually. I will add snapshots soon.

Closes #28485 from iRakson/SPARK-31642.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2020-05-23 17:17:53 +09:00
Takeshi Yamamuro 7ca73f03fb [SPARK-29854][SQL][TESTS] Add tests to check lpad/rpad throw an exception for invalid length input
### What changes were proposed in this pull request?

This PR intends to add trivial tests to check https://github.com/apache/spark/pull/27024 has already been fixed in the master.

Closes #27024

### Why are the changes needed?

For test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28604 from maropu/SPARK-29854.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-23 08:48:29 +09:00
Jungtaek Lim (HeartSaVioR) 5a258b0b67
[SPARK-30915][SS] CompactibleFileStreamLog: Avoid reading the metadata log file when finding the latest batch ID
### What changes were proposed in this pull request?

This patch adds the new method `getLatestBatchId()` in CompactibleFileStreamLog in complement of getLatest() which doesn't read the content of the latest batch metadata log file, and apply to both FileStreamSource and FileStreamSink to avoid unnecessary latency on reading log file.

### Why are the changes needed?

Once compacted metadata log file becomes huge, writing outputs for the compact + 1 batch is also affected due to unnecessarily reading the compacted metadata log file. This unnecessary latency can be simply avoided.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT. Also manually tested under query which has huge metadata log on file stream sink:

> before applying the patch

![Screen Shot 2020-02-21 at 4 20 19 PM](https://user-images.githubusercontent.com/1317309/75016223-d3ffb180-54cd-11ea-9063-49405943049d.png)

> after applying the patch

![Screen Shot 2020-02-21 at 4 06 18 PM](https://user-images.githubusercontent.com/1317309/75016220-d235ee00-54cd-11ea-81a7-7c03a43c4db4.png)

Peaks are compact batches - please compare the next batch after compact batches, especially the area of "light brown".

Closes #27664 from HeartSaVioR/SPARK-30915.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-05-22 16:46:17 -07:00
TJX2014 2115c55efe [SPARK-31710][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
### What changes were proposed in this pull request?
Add and register three new functions: `TIMESTAMP_SECONDS`, `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS`
A test is added.

Reference: [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds)

### Why are the changes needed?
People will have convenient way to get timestamps from seconds,milliseconds and microseconds.

### Does this PR introduce _any_ user-facing change?
Yes, people will have the following ways to get timestamp:

```scala
sql("select TIMESTAMP_SECONDS(t.a) as timestamp from values(1230219000),(-1230219000) as t(a)").show(false)
```
```
+-------------------------+
|timestamp                  |
+-------------------------+
|2008-12-25 23:30:00|
|1931-01-07 16:30:00|
+-------------------------+
```
```scala
sql("select TIMESTAMP_MILLIS(t.a) as timestamp from values(1230219000123),(-1230219000123) as t(a)").show(false)
```
```
+-------------------------------+
|timestamp                           |
+-------------------------------+
|2008-12-25 23:30:00.123|
|1931-01-07 16:29:59.877|
+-------------------------------+
```
```scala
sql("select TIMESTAMP_MICROS(t.a) as timestamp from values(1230219000123123),(-1230219000123123) as t(a)").show(false)
```
```
+------------------------------------+
|timestamp                                   |
+------------------------------------+
|2008-12-25 23:30:00.123123|
|1931-01-07 16:29:59.876877|
+------------------------------------+
```
### How was this patch tested?
Unit test.

Closes #28534 from TJX2014/master-SPARK-31710.

Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-22 14:16:30 +00:00
Wenchen Fan ce4da29ec3 [SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string
### What changes were proposed in this pull request?

This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.

If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.

### Why are the changes needed?

To keep backward compatibility with Spark 2.4.

### Does this PR introduce _any_ user-facing change?

Yes.

Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
|     1970-01-16 00:00:00|
+------------------------+

scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
|         1970-01-16|
+-------------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
|               2019-01-01 00:40:00|
+----------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
|                           2019-01-01 10:10:10|
+----------------------------------------------+
```

in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
|                null|
+--------------------+

scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
|           null|
+---------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
|           2019-01-01 00:00:00|
+------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
|                       2019-01-01 00:00:00|
+------------------------------------------+
```

After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.

### How was this patch tested?

new tests

Closes #28576 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-22 16:10:08 +09:00
Max Gekk 60118a2426 [SPARK-31785][SQL][TESTS] Add a helper function to test all parquet readers
### What changes were proposed in this pull request?
Add `withAllParquetReaders` to `ParquetTest`. The function allow to run a block of code for all available Parquet readers.

### Why are the changes needed?
1. It simplifies tests
2. Allow to test all parquet readers that could be available in projects based on Apache Spark.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running affected test suites.

Closes #28598 from MaxGekk/add-withAllParquetReaders.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-22 09:53:35 +09:00
Gengliang Wang db5e5fce68 Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0"
This reverts commit 92877c4ef2.

Closes #28602 from gengliangwang/revertSPARK-31765.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-05-21 16:00:58 -07:00
Kousuke Saruta 92877c4ef2 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0
### What changes were proposed in this pull request?

This PR upgrades HtmlUnit.
Selenium and Jetty also upgraded because of dependency.
### Why are the changes needed?

Recently, a security issue which affects HtmlUnit is reported.
https://nvd.nist.gov/vuln/detail/CVE-2020-5529
According to the report, arbitrary code can be run by malicious users.
HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing testcases.

Closes #28585 from sarutak/upgrade-htmlunit.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-05-21 11:43:25 -07:00
iRakson f1495c5bc0 [SPARK-31688][WEBUI] Refactor Pagination framework
### What changes were proposed in this pull request?
Currently while implementing pagination using the existing pagination framework, a lot of code is being copied as pointed out [here](https://github.com/apache/spark/pull/28485#pullrequestreview-408881656).

I introduced some changes in `PagedTable` which is the main trait for implementing the pagination.
* Added function for getting table parameters.
* Added a function for table header row. This will help in maintaining consistency across the tables. All the header rows across tables will be consistent now.

### Why are the changes needed?

* A lot of code is copied every time pagination is implemented for any table.
* Code readability is not great as lot of HTML is embedded.
* Paginating other tables will be a lot easier now.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually. This is mainly refactoring work, no new functionality introduced. Existing test cases should pass.

Closes #28512 from iRakson/refactorPaginationFramework.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-21 13:00:00 -05:00
Vinoo Ganesh dae79888dc [SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener
## What changes were proposed in this pull request?

This change was made as a result of the conversation on https://issues.apache.org/jira/browse/SPARK-31354 and is intended to continue work from that ticket here.

This change fixes a memory leak where SparkSession listeners are never cleared off of the SparkContext listener bus.

Before running this PR, the following code:
```
SparkSession.builder().master("local").getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()

SparkSession.builder().master("local").getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
```
would result in a SparkContext with the following listeners on the listener bus:
```
[org.apache.spark.status.AppStatusListener5f610071,
org.apache.spark.HeartbeatReceiverd400c17,
org.apache.spark.sql.SparkSession$$anon$125849aeb, <-First instance
org.apache.spark.sql.SparkSession$$anon$1fadb9a0] <- Second instance
```
After this PR, the execution of the same code above results in SparkContext with the following listeners on the listener bus:
```
[org.apache.spark.status.AppStatusListener5f610071,
org.apache.spark.HeartbeatReceiverd400c17,
org.apache.spark.sql.SparkSession$$anon$125849aeb] <-One instance
```
## How was this patch tested?

* Unit test included as a part of the PR

Closes #28128 from vinooganesh/vinooganesh/SPARK-27958.

Lead-authored-by: Vinoo Ganesh <vinoo.ganesh@gmail.com>
Co-authored-by: Vinoo Ganesh <vganesh@palantir.com>
Co-authored-by: Vinoo Ganesh <vinoo@safegraph.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-21 16:06:28 +00:00
Max Gekk 5d673319af [SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString
### What changes were proposed in this pull request?
1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings:
    - TimestampFormatter:
      - `def format(ts: Timestamp): String`
      - `def format(instant: Instant): String`
    - DateFormatter:
      - `def format(date: Date): String`
      - `def format(localDate: LocalDate): String`
2. Re-use the added methods from `HiveResult.toHiveString`
3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests for toHiveString and new tests in `TimestampFormatterSuite`.

Closes #28582 from MaxGekk/opt-format-old-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-21 04:01:19 +00:00
Ali Smesseim d40ecfa3f7
[SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener
### What changes were proposed in this pull request?

This is a recreation of #28155, which was reverted due to causing test failures.

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

To improve robustness, we also make the following changes in HiveSessionImpl.close():

- Catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.
- Handle not being able to access the scratch directory. When closing, all `.pipeout` files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28544 from alismess-db/hive-thriftserver-listener-update-safer-2.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-20 10:30:17 -07:00
Wenchen Fan 34414acfa3 [SPARK-31706][SQL] add back the support of streaming update mode
### What changes were proposed in this pull request?

This PR adds a private `WriteBuilder` mixin trait: `SupportsStreamingUpdate`, so that the builtin v2 streaming sinks can still support the update mode.

Note: it's private because we don't have a proper design yet. I didn't take the proposal in https://github.com/apache/spark/pull/23702#discussion_r258593059 because we may want something more general, like updating by an expression `key1 = key2 + 10`.

### Why are the changes needed?

In Spark 2.4, all builtin v2 streaming sinks support all streaming output modes, and v2 sinks are enabled by default, see https://issues.apache.org/jira/browse/SPARK-22911

It's too risky for 3.0 to go back to v1 sinks, so I propose to add a private trait to fix builtin v2 sinks, to keep backward compatibility.

### Does this PR introduce _any_ user-facing change?

Yes, now all the builtin v2 streaming sinks support all streaming output modes, which is the same as 2.4

### How was this patch tested?

existing tests.

Closes #28523 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-20 03:45:13 +00:00
yi.wu 0fd98abd85 [SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType
### What changes were proposed in this pull request?

Eliminate the `UpCast` if it's child data type is already decimal type.

### Why are the changes needed?

While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:

```
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
```
```
[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
```

### Does this PR introduce _any_ user-facing change?

Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.

### How was this patch tested?

Added tests.

Closes #28572 from Ngone51/fix_encoder.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-20 11:00:58 +09:00
Kent Yao 1f29f1ba58 [SPARK-31684][SQL] Overwrite partition failed with 'WRONG FS' when the target partition is not belong to the filesystem as same as the table
### What changes were proposed in this pull request?

With SPARK-18107, we will disable the underlying replace(overwrite) and instead do delete in spark side and only do copy in hive side to bypass the performance issue - [HIVE-11940](https://issues.apache.org/jira/browse/HIVE-11940)

Conditionally, if the table location and partition location do not belong to the same `FileSystem`, We should not disable hive overwrite. Otherwise, hive will use the `FileSystem` instance belong to the table location to copy files, which will fail in `FileSystem#checkPath`
https://github.com/apache/hive/blob/rel/release-2.3.7/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1657

In this PR, for Hive 2.0.0 and onwards, as [HIVE-11940](https://issues.apache.org/jira/browse/HIVE-11940) has been fixed, and there is no performance issue anymore. We should leave the overwrite logic to hive to avoid failure in `FileSystem#checkPath`

**NOTE THAT**  For Hive 2.2.0 and earlier, if the table and partition locations do not belong together, we will still get the same error thrown by hive encryption check due to  [HIVE-14380]( https://issues.apache.org/jira/browse/HIVE-14380) which need to fix in another ticket SPARK-31675.

### Why are the changes needed?

bugfix. a logic table can be decoupled with the storage layer and may contain data from remote storage systems.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Currently verified manually. add benchmark tests

```sql
-INSERT INTO DYNAMIC                                7742           7918         248          0.0      756044.0       1.0X
-INSERT INTO HYBRID                                 1289           1307          26          0.0      125866.3       6.0X
-INSERT INTO STATIC                                  371            393          38          0.0       36219.4      20.9X
-INSERT OVERWRITE DYNAMIC                           8456           8554         138          0.0      825790.3       0.9X
-INSERT OVERWRITE HYBRID                            1303           1311          12          0.0      127198.4       5.9X
-INSERT OVERWRITE STATIC                             434            447          13          0.0       42373.8      17.8X
+INSERT INTO DYNAMIC                                7382           7456         105          0.0      720904.8       1.0X
+INSERT INTO HYBRID                                 1128           1129           1          0.0      110169.4       6.5X
+INSERT INTO STATIC                                  349            370          39          0.0       34095.4      21.1X
+INSERT OVERWRITE DYNAMIC                           8149           8362         301          0.0      795821.8       0.9X
+INSERT OVERWRITE HYBRID                            1317           1318           2          0.0      128616.7       5.6X
+INSERT OVERWRITE STATIC                             387            408          37          0.0       37804.1      19.1X
```

+ for master
- for this PR

both using hive 2.3.7

Closes #28511 from yaooqinn/SPARK-31684.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-19 14:08:51 +00:00
Ali Afroozeh b9cc31cd95 [SPARK-31721][SQL] Assert optimized is initialized before tracking the planning time
### What changes were proposed in this pull request?
The QueryPlanningTracker in QueryExeuction reports the planning time that also includes the optimization time. This happens because the optimizedPlan in QueryExecution is lazy and only will initialize when first called. When df.queryExecution.executedPlan is called, the the tracker starts recording the planning time, and then calls the optimized plan. This causes the planning time to start before optimization and also include the planning time.
This PR fixes this behavior by introducing a method assertOptimized, similar to assertAnalyzed that explicitly initializes the optimized plan. This method is called before measuring the time for sparkPlan and executedPlan. We call it before sparkPlan because that also counts as planning time.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests

Closes #28543 from dbaliafroozeh/AddAssertOptimized.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-05-19 11:10:49 +02:00
Eren Avsarogullari ab4cf49a1c [SPARK-31440][SQL] Improve SQL Rest API
### What changes were proposed in this pull request?
SQL Rest API exposes query execution metrics as Public API. This PR aims to apply following improvements on SQL Rest API by aligning Spark-UI.

**Proposed Improvements:**
1- Support Physical Operations and group metrics per physical operation by aligning Spark UI.
2- Support `wholeStageCodegenId` for Physical Operations
3- `nodeId` can be useful for grouping metrics and sorting physical operations (according to execution order) to differentiate same operators (if used multiple times during the same query execution) and their metrics.
4- Filter `empty` metrics by aligning with Spark UI - SQL Tab. Currently, Spark UI does not show empty metrics.
5- Remove line breakers(`\n`) from `metricValue`.
6- `planDescription` can be `optional` Http parameter to avoid network cost where there is specially complex jobs creating big-plans.
7- `metrics` attribute needs to be exposed at the bottom order as `nodes`. Specially, this can be useful for the user where `nodes` array size is high.
8- `edges` attribute is being exposed to show relationship between `nodes`.
9- Reverse order on `metricDetails` aims to match with Spark UI by supporting Physical Operators' execution order.

### Why are the changes needed?
Proposed improvements provides more useful (e.g: physical operations and metrics correlation, grouping) and clear (e.g: filtering blank metrics, removing line breakers) result for the end-user.

### Does this PR introduce any user-facing change?
Yes. Please find both current and improved versions of the results as attached for following SQL Rest Endpoint:
```
curl -X GET http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true
```
**Current version:**
https://issues.apache.org/jira/secure/attachment/12999821/current_version.json

**Improved version:**
https://issues.apache.org/jira/secure/attachment/13000621/improved_version.json

### Backward Compatibility
SQL Rest API will be started to expose with `Spark 3.0` and `3.0.0-preview2` (released on 12/23/19) does not cover this API so if PR can catch 3.0 release, this will not have any backward compatibility issue.

### How was this patch tested?
1. New Unit tests are added.
2. Also, patch has been tested manually through both **Spark Core** and **History Server** Rest APIs.

Closes #28208 from erenavsarogullari/SPARK-31440.

Authored-by: Eren Avsarogullari <eren.avsarogullari@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-05-18 23:21:32 -07:00
Max Gekk fc5b90243c [SPARK-31727][SQL] Fix error message of casting timestamp to int in ANSI non-codegen mode
### What changes were proposed in this pull request?
Change timestamp casting to int in ANSI and non-codegen mode, and make the error message consistent to the error messages in the codegen mode. In particular, casting to int is implemented in the same way as casting to short and byte.

### Why are the changes needed?
1. The error message in the non-codegen mode is diversed from the error message in the codegen mode.
2. The error message contains intermediate results that could confuse.

### Does this PR introduce _any_ user-facing change?
Yes. Before the changes, the error message of casting timestamp to int contains intermediate result but after the changes it contains the input values which causes arithmetic overflow.

### How was this patch tested?
By running the modified test suite `AnsiCastSuite`.

Closes #28549 from MaxGekk/fix-error-msg-cast-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-18 05:00:50 +00:00
Jungtaek Lim (HeartSaVioR) d2bec5e265 [SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
### What changes were proposed in this pull request?

This patch effectively reverts SPARK-30098 via below changes:

* Removed the config
* Removed the changes done in parser rule
* Removed the usage of config in tests
  * Removed tests which depend on the config
  * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098
* Reflect the change into docs (migration doc, create table syntax)

### Why are the changes needed?

SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change.

Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details.

### Does this PR introduce _any_ user-facing change?

No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases.

### How was this patch tested?

Existing UTs.

Closes #28517 from HeartSaVioR/revert-SPARK-30098.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-17 02:27:23 +00:00
Max Gekk 5539ecfdac [SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default
### What changes were proposed in this pull request?
Set default time zone and locale in the default constructor of `SparkFunSuite`:
- Default time zone to `America/Los_Angeles`
- Default locale to `Locale.US`

### Why are the changes needed?
1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite
2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running all affected test suites

Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-17 02:26:00 +00:00
Kent Yao 1d66085a93 [SPARK-31289][TEST][TEST-HIVE1.2] Eliminate org.apache.spark.sql.hive.thriftserver.CliSuite flakiness
### What changes were proposed in this pull request?

CliSuite seems to be flaky while using metastoreDir per test.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/org.apache.spark.sql.hive.thriftserver/CliSuite/

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/junit/org.apache.spark.sql.hive.thriftserver/CliSuite/history/

According to the error stack trace in the failed test, the test failed to instantiate a hive metastore client because of derby requirements.
```scala
Caused by: ERROR XBM0A:
 The database directory '/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-9249ce52-0a06-42b6-a3df-e6295e880df0' exists.
However, it does not contain the expected 'service.properties' file.
Perhaps Derby was brought down in the middle of creating this database.
You may want to delete this directory and try creating the database again.
```

The derby requires the metastore dir does not exist, but it does exist probably due to the test case before it failed to clear the metastore dir

In this PR, the metastore is shared across the tests of CliSuite except those explicitly asked a separated metastore env itself

### Why are the changes needed?

CliSuite seems to be flaky while using metastoreDir per test.
 To eliminate test flakiness
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

modified test

Closes #28055 from yaooqinn/clisuite.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-16 09:11:21 +00:00
Yuanjian Li 86bd37f37e [SPARK-31663][SQL] Grouping sets with having clause returns the wrong result
### What changes were proposed in this pull request?
- Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`:
    - Change the operations resolving directions to top-down.
    - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions`
    - Push the aggregate expressions into the aggregate which contains the expanded operations.
- Use UnresolvedHaving for all having clause.

### Why are the changes needed?
Correctness bug fix. See the demo and analysis in SPARK-31663.

### Does this PR introduce _any_ user-facing change?
Yes, correctness bug fix for HAVING with GROUPING SETS.

### How was this patch tested?
New UTs added.

Closes #28501 from xuanyuanking/SPARK-31663.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-16 04:37:18 +00:00
yi.wu d8b001fa87 [SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery
### What changes were proposed in this pull request?

Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression  for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`.

### Why are the changes needed?

When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105

For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348

This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy.

Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`.

### Does this PR introduce _any_ user-facing change?

Yes, user hit error previously but run query successfully after this change.

### How was this patch tested?

Added a regression test.

Closes #28496 from Ngone51/spark-31620.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 15:36:28 +00:00
Dongjoon Hyun 5d90886523
[SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite
# What changes were proposed in this pull request?

This PR aims to provide a fallback version instead of `Nil` in `HiveExternalCatalogVersionsSuite`. The provided fallback Spark versions recovers Jenkins jobs instead of failing.

### Why are the changes needed?

Currently, `HiveExternalCatalogVersionsSuite` is aborted in all Jenkins jobs except JDK11 Jenkins jobs which don't have old Spark releases supporting JDK11.
```
HiveExternalCatalogVersionsSuite:
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
  Exception encountered when invoking run on a nested suite - Fail to get the lates Spark versions to test. (HiveExternalCatalogVersionsSuite.scala:180)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins

Closes #28536 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-15 00:30:25 -07:00
Kent Yao 503faa24d3 [SPARK-31715][SQL][TEST] Fix flaky SparkSQLEnvSuite that sometimes varies single derby instance standard
### What changes were proposed in this pull request?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122622/testReport/junit/org.apache.spark.sql.hive.thriftserver/SparkSQLEnvSuite/SPARK_29604_external_listeners_should_be_initialized_with_Spark_classloader/history/?start=25

According to the test report history of SparkSQLEnvSuite,this test fails frequently which is caused by single derby instance restriction.

```java
Caused by: sbt.ForkMain$ForkError: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/metastore_db.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source)
	... 138 more
```

This PR adds a separate directory to locate the metastore_db for this test which runs in a dedicated JVM.

Besides, diable the UI for the potential race on `spark.ui.port` which may also let the test case become flaky.

### Why are the changes needed?
test fix

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
SparkSQLEnvSuite itself.

Closes #28537 from yaooqinn/SPARK-31715.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 06:36:34 +00:00
Max Gekk c7ce37dfa7 [SPARK-31712][SQL][TESTS] Check casting timestamps before the epoch to Byte/Short/Int/Long types
### What changes were proposed in this pull request?
Added tests to check casting timestamps before 1970-01-01 00:00:00Z to ByteType, ShortType, IntegerType and LongType in ansi and non-ansi modes.

### Why are the changes needed?
To improve test coverage and prevent errors while modifying the CAST expression code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ ./build/sbt "test:testOnly *CastSuite"
```

Closes #28531 from MaxGekk/test-cast-timestamp-to-byte.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 04:24:58 +00:00
Dongjoon Hyun bbb62c5405
Revert "[SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener"
This reverts commit 6994c64efd.
2020-05-14 12:01:03 -07:00
sunke.03 ddbce4edee [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
### What changes were proposed in this pull request?

This PR try to fix a bug in `org.apache.spark.sql.hive.execution.ScriptTransformationExec`. This bug appears in our online cluster.  `ScriptTransformationExec` should throw an exception, when user uses a python script which contains parse error.  But current implementation may miss this case of failure.

### Why are the changes needed?

When user uses a python script which contains a parse error, there will be no output. So  `scriptOutputReader.next(scriptOutputWritable) <= 0` matches, then we use `checkFailureAndPropagate()` to check the `proc`.  But the `proc` may still be alive and `writerThread.exception` is not defined,  `checkFailureAndPropagate` cannot check this case of failure.  In the end, the Spark SQL job runs successfully and returns no result. In fact, the SparK SQL job should fails and shows the exception properly.

For example, the error python script is blow.
``` python
# encoding: utf8
import unknow_module
import sys

for line in sys.stdin:
    print line
```
The bug can be reproduced by running the following code in our cluter.
```
spark.range(100*100).toDF("index").createOrReplaceTempView("test")
spark.sql("select TRANSFORM(index) USING 'python error_python.py' as new_index from test").collect.foreach(println)
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing UT

Closes #27724 from slamke/transformation.

Authored-by: sunke.03 <sunke.03@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-14 13:55:24 +00:00
Karuppayya Rajendran 72601460ad
[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory
### What changes were proposed in this pull request?
Pass hadoop confs  specifed via Spark confs to URLStreamHandlerfactory

### Why are the changes needed?

**BEFORE**
```
➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03
```

**AFTER**
```
➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = LocalFS

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass
res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem
```
The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested locally.
Added Unit test

Closes #28516 from karuppayya/SPARK-31692.

Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-13 23:18:38 -07:00
Wenchen Fan fd2d55c991 [SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files
### What changes were proposed in this pull request?

When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not.

### Why are the changes needed?

Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values.

### Does this PR introduce _any_ user-facing change?

Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config.

### How was this patch tested?

updated tests

Closes #28477 from cloud-fan/rebase.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-14 12:32:40 +09:00
Ali Smesseim 6994c64efd [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener
### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-12 09:14:34 -07:00
Max Gekk a3fafddf39 [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator
### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`.

### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`.

Closes #28502 from MaxGekk/random-java8-datetime.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 14:05:31 +00:00
Javier Fuentes 178ca961fe [SPARK-31102][SQL] Spark-sql fails to parse when contains comment
### What changes were proposed in this pull request?

This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

### Why are the changes needed?

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes #27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 13:46:24 +00:00