Commit graph

7587 commits

Author SHA1 Message Date
Maxim Gekk 9b55722161 [SPARK-27031][SQL] Avoid double formatting in timestampToString
## What changes were proposed in this pull request?

Removed unnecessary conversion of microseconds in `DateTimeUtils.timestampToString` to `java.sql.Timestamp` which aims to output fraction of seconds by casting it to string. This was replaced by special `TimestampFormatter` which appends the fraction formatter to `DateTimeFormatterBuilder`: `appendFraction(ChronoField.NANO_OF_SECOND, 0, 9, true)`. The former one means trailing zeros in second's fraction should be truncated while formatting.

## How was this patch tested?

By existing test suites like `CastSuite`, `DateTimeUtilsSuite`, `JDBCSuite`, and by new test in `TimestampFormatterSuite`.

Closes #23936 from MaxGekk/timestamp-to-string.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-06 08:26:59 -06:00
Liang-Chi Hsieh 83857496e5 [SPARK-27043][SQL] Add ORC nested schema pruning benchmarks
## What changes were proposed in this pull request?

We have benchmark of nested schema pruning, but only for Parquet. This adds similar benchmark for ORC. This is used with nested schema pruning of ORC.

## How was this patch tested?

Added test.

Closes #23955 from viirya/orc-nested-schema-pruning-benchmark.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-05 11:12:57 -08:00
Takeshi Yamamuro 4490fd0ff0 [SPARK-27001][SQL][FOLLOW-UP] Drop Serializable in WalkedTypePath
## What changes were proposed in this pull request?
This pr tried to drop `Serializable` in `WalkedTypePath`.

## How was this patch tested?
Pass Jenkins.

Closes #23973 from maropu/SPARK-27001-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-05 23:05:50 +08:00
Yuming Wang 940626b724 [SPARK-15095][FOLLOW-UP][SQL] Remove HiveSessionHook related code from ThriftServer
## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/12881 removed `HiveSessionHook`. But there are still some code related to `HiveSessionHook`.
This PR removes all `HiveSessionHook` related code.

## How was this patch tested?

manual tests

Closes #23957 from wangyum/SPARK-15095.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-05 07:42:25 -06:00
Anton Okolnychyi 0c23a39384 [SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates
## What changes were proposed in this pull request?

This PR optimizes `InSet` expressions for byte, short, integer, date types. It is a follow-up on PR #21442 from dbtsai.

`In` expressions are compiled into a sequence of if-else statements, which results in O\(n\) time complexity. `InSet` is an optimized version of `In`, which is supposed to improve the performance if all values are literals and the number of elements is big enough. However, `InSet` actually worsens the performance in many cases due to various reasons.

The main idea of this PR is to use Java `switch` statements to significantly improve the performance of `InSet` expressions for bytes, shorts, ints, dates. All `switch` statements are compiled into `tableswitch` and `lookupswitch` bytecode instructions. We will have O\(1\) time complexity if our case values are compact and `tableswitch` can be used. Otherwise, `lookupswitch` will give us O\(log n\).

Locally, I tried Spark `OpenHashSet` and primitive collections from `fastutils` in order to solve the boxing issue in `InSet`. Both options significantly decreased the memory consumption and `fastutils` improved the time compared to `HashSet` from Scala. However, the switch-based approach was still more than two times faster even on 500+ non-compact elements.

I also noticed that applying the switch-based approach on less than 10 elements gives a relatively minor improvement compared to the if-else approach. Therefore, I placed the switch-based logic into `InSet` and added a new config to track when it is applied. Even if we migrate to primitive collections at some point, the switch logic will be still faster unless the number of elements is really big. Another option is to have a separate `InSwitch` expression. However, this would mean we need to modify other places (e.g., `DataSourceStrategy`).

See [here](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10) and [here](https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch) for more information.

This PR does not cover long values as Java `switch` statements cannot be used on them. However, we can have a follow-up PR with an approach similar to binary search.

## How was this patch tested?

There are new tests that verify the logic of the proposed optimization.

The performance was evaluated using existing benchmarks. This PR was also tested on an EC2 instance (OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64, Intel(R) Xeon(R) CPU E5-2686 v4  2.30GHz).

## Notes

- [This link](http://hg.openjdk.java.net/jdk8/jdk8/langtools/file/30db5e0aaf83/src/share/classes/com/sun/tools/javac/jvm/Gen.java#l1153) contains source code that decides between `tableswitch` and `lookupswitch`. The logic was re-used in the benchmarks. See the `isLookupSwitch` method.

Closes #23171 from aokolnychyi/spark-26205.

Lead-authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-04 15:40:04 -08:00
Sean Owen 0deebd3820 [SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8
## What changes were proposed in this pull request?

Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8 as they use Hadoop's implementation underneath. I think these are all the places that this needs a mention in the user-facing docs.

## How was this patch tested?

Doc tests.

Closes #23962 from srowen/SPARK-26016.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-05 08:03:39 +09:00
Yuming Wang e64c110d21 [MINOR][BUILD] Remove useless add-source
## What changes were proposed in this pull request?

The source directory(`sql/hive-thriftserver/v${hive.version.short}/src/main/scala`) removed from  [SPARK-6909](https://issues.apache.org/jira/browse/SPARK-6909) and [SPARK-7850](https://issues.apache.org/jira/browse/SPARK-7850). We should also remove the `add-source`.
It seems that removed this `add-source` makes it easier to import `src/gen` source code in IDEA:
![image](https://user-images.githubusercontent.com/5399861/53715396-a967b380-3e8c-11e9-8aa1-c59d819b4c06.png)
![image](https://user-images.githubusercontent.com/5399861/53715402-acfb3a80-3e8c-11e9-8aa9-a716931160c6.png)

## How was this patch tested?

manual tests

Closes #23949 from wangyum/SPARK-7850.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-04 08:43:17 -06:00
Gengliang Wang 68fa601d62 [SPARK-27040][SQL] Avoid using unnecessary JoinRow in FileFormat
## What changes were proposed in this pull request?

When reading files with empty partition columns, we can avoid using JoinRow.

## How was this patch tested?

Existing unit tests.

Closes #23953 from gengliangwang/avoidJoinRow.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-04 22:26:11 +08:00
Takeshi Yamamuro 68fbbbea4e [SPARK-26965][SQL] Makes ElementAt nullability more precise for array cases
## What changes were proposed in this pull request?
In master, `ElementAt` nullable is always true;
be1cadf16d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (L1977)

But, If input is an array and foldable, we could make its nullability more precise.
This fix is based on  SPARK-26637(#23566).

## How was this patch tested?
Added tests in `CollectionExpressionsSuite`.

Closes #23867 from maropu/SPARK-26965.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-04 21:27:18 +08:00
Dilip Biswal ad4823c99d [SPARK-19712][SQL] Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
## What changes were proposed in this pull request?
This PR adds support for pushing down LeftSemi and LeftAnti joins below operators such as Project, Aggregate, Window, Union etc.  This is the initial piece of work that will be needed for
the subsequent work of moving the subquery rewrites to the beginning of optimization phase.

The larger  PR is [here](https://github.com/apache/spark/pull/23211) . This PR addresses the comment at [link](https://github.com/apache/spark/pull/23211#issuecomment-445705922).
## How was this patch tested?
Added a new test suite LeftSemiAntiJoinPushDownSuite.

Closes #23750 from dilipbiswal/SPARK-19712-pushleftsemi.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-04 19:09:24 +08:00
Wenchen Fan 382d5a82b0 [SPARK-26956][SS] remove streaming output mode from data source v2 APIs
## What changes were proposed in this pull request?

Similar to `SaveMode`, we should remove streaming `OutputMode` from data source v2 API, and use operations that has clear semantic.

The changes are:
1. append mode: create `StreamingWrite` directly. By default, the `WriteBuilder` will create `Write` to append data.
2. complete mode: call `SupportsTruncate#truncate`. Complete mode means truncating all the old data and appending new data of the current epoch. `SupportsTruncate` has exactly the same semantic.
3. update mode: fail. The current streaming framework can't propagate the update keys, so v2 sinks are not able to implement update mode. In the future we can introduce a `SupportsUpdate` trait.

The behavior changes:
1. all the v2 sinks(foreach, console, memory, kafka, noop) don't support update mode. The fact is, previously all the v2 sinks implement the update mode wrong. None of them can really support it.
2. kafka sink doesn't support complete mode. The fact is, the kafka sink can only append data.

## How was this patch tested?

existing tests

Closes #23859 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-03-03 22:20:31 -08:00
Peter Toth 82820f8e25 [SPARK-26893][SQL] Allow partition pruning with subquery filters on file source
## What changes were proposed in this pull request?

This PR introduces leveraging of subquery filters for partition pruning in file source.

Subquery expressions are not allowed to be used for partition pruning in `FileSourceStrategy` now, instead a `FilterExec` is added around the `FileSourceScanExec` to do the job.
This PR optimizes the process by allowing partition pruning subquery expressions as partition filters.

## How was this patch tested?

Added new UT and run existing UTs especially SPARK-25482 and SPARK-24085 related ones.

Closes #23802 from peter-toth/SPARK-26893.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-04 13:38:22 +08:00
Sean Owen b76f262fc8 [SPARK-27032][TEST] De-flake org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision
## What changes were proposed in this pull request?

Reduce work in HDFSMetadataLogSuite test to possibly de-flake it.

## How was this patch tested?

Existing tests

Closes #23937 from srowen/SPARK-27032.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-04 13:36:41 +09:00
Jungtaek Lim (HeartSaVioR) 34f606678a [SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference
## What changes were proposed in this pull request?

This patch proposes refactoring `serializerFor` method between `ScalaReflection` and `JavaTypeInference`, being consistent with what we refactored for `deserializerFor` in #23854.

This patch also extracts the logic on recording walk type path since the logic is duplicated across `serializerFor` and `deserializerFor` with `ScalaReflection` and `JavaTypeInference`.

## How was this patch tested?

Existing tests.

Closes #23908 from HeartSaVioR/SPARK-27001.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-04 10:45:48 +08:00
Dilip Biswal 04ad559ab6 [SPARK-27016][SQL][BUILD] Treat all antlr warnings as errors while generating parser from the sql grammar file.
## What changes were proposed in this pull request?
Use the maven plugin option `treatWarningsAsErrors` to make sure the warnings are treated as errors while generating the parser file. In the absence of it, we may inadvertently introducing problems while making grammar changes.  Please refer to [PR-23897](https://github.com/apache/spark/pull/23897) to know more about the context.
## How was this patch tested?
We can use two ways to build Spark 1) sbt 2) Maven
This PR, we made a change to configure the maven antlr plugin to include a parameter that makes antlr4 report error on warning. However, when spark is built using sbt, we use the sbt antlr plugin which does not allow us to pass this additional compilation flag.  More info on sbt-antlr plugin can be found at [link](https://github.com/ihji/sbt-antlr4/blob/master/src/main/scala/com/simplytyped/Antlr4Plugin.scala)
In summary, this fix only applicable when we use maven to build.

Closes #23925 from dilipbiswal/antlr_fix.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-03 10:02:25 -06:00
Dilip Biswal 4a486d6716 [SPARK-26982][SQL] Enhance describe framework to describe the output of a query.
## What changes were proposed in this pull request?
Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface.

Example:

DESCRIBE SELECT * FROM desc_table
DESCRIBE QUERY SELECT * FROM desc_table
```SQL

spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string);

spark-sql> desc select * from desc_table;
c1	int	        c1-comment
c2	decimal(10,0)	c2-comment
c3	string	        NULL

```
## How was this patch tested?
Added a new test under SQLQueryTestSuite and SparkSqlParserSuite

Closes #23883 from dilipbiswal/dkb_describe_query.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-02 11:21:23 +08:00
Dilip Biswal 5fd62ca65a [SPARK-26215][SQL][FOLLOW-UP][MINOR] Fix the warning from ANTR4
## What changes were proposed in this pull request?
I see the following new warning from ANTR4 after SPARK-26215 after it added `SCHEMA` keyword in the reserved/unreserved list. This is a minor PR to cleanup the warning.

```
WARNING] warning(125): org/apache/spark/sql/catalyst/parser/SqlBase.g4:784:90: implicit definition of token SCHEMA in parser
[WARNING] .../apache/spark/org/apache/spark/sql/catalyst/parser/SqlBase.g4 [784:90]: implicit definition of token SCHEMA in parser
```
## How was this patch tested?
Manually built catalyst after the fix to verify

Closes #23897 from dilipbiswal/minor_parser_token.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-01 12:34:15 -08:00
liuxian 02bbe977ab [MINOR] Remove unnecessary gets when getting a value from map.
## What changes were proposed in this pull request?

Redundant `get`  when getting a value from `Map` given a key.

## How was this patch tested?

N/A

Closes #23901 from 10110346/removegetfrommap.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 11:48:07 -06:00
Maxim Gekk 8e5f9995ca [SPARK-27008][SQL] Support java.time.LocalDate as an external type of DateType
## What changes were proposed in this pull request?

In the PR, I propose to add new Catalyst type converter for `DateType`. It should be able to convert `java.time.LocalDate` to/from `DateType`.

Main motivations for the changes:
- Smoothly support Java 8 time API
- Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and `java.sql.Date` (hybrid calendar - Julian + Gregorian).
- Make conversion independent from current system timezone.

By default, Spark converts values of `DateType` to `java.sql.Date` instances but the SQL config `spark.sql.datetime.java8API.enabled` can change the behavior. If it is set to `true`, Spark uses `java.time.LocalDate` as external type for `DateType`.

## How was this patch tested?

Added new testes to `CatalystTypeConvertersSuite` to check conversion of `DateType` to/from `java.time.LocalDate`, `JavaUDFSuite`/ `UDFSuite` to test usage of `LocalDate` type in Scala/Java UDFs.

Closes #23913 from MaxGekk/date-localdate.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-01 11:04:28 +08:00
Gabor Somogyi c4bbfd177b [SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution
## What changes were proposed in this pull request?

Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added `spark.sql.streaming.continuous.epochBacklogQueueSize` configuration possibility to make them bounded. If the related threshold reached then the query will stop with `IllegalStateException`.

## How was this patch tested?

Existing + additional unit tests.

Closes #23156 from gaborgsomogyi/SPARK-24063.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-27 09:52:43 -08:00
liuxian 7912dbb88f [MINOR] Simplify boolean expression
## What changes were proposed in this pull request?

Comparing whether Boolean expression is equal to true is redundant
For example:
The datatype of `a` is boolean.
Before:
if (a == true)
After:
if (a)

## How was this patch tested?
N/A

Closes #23884 from 10110346/simplifyboolean.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-27 08:38:00 -06:00
Maxim Gekk b0450d07bd [SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType
## What changes were proposed in this pull request?

In the PR, I propose to add new Catalyst type converter for `TimestampType`. It should be able to convert `java.time.Instant` to/from `TimestampType`.

Main motivations for the changes:
- Smoothly support Java 8 time API
- Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and `java.sql.Timestamp` (hybrid calendar - Julian + Gregorian).
- Make conversion independent from current system timezone.

By default, Spark converts values of `TimestampType` to `java.sql.Timestamp` instances but the SQL config `spark.sql.catalyst.timestampType` can change the behavior. It accepts two values `Timestamp` (default) and `Instant`. If the former one is set, Spark returns `java.time.Instant` instances for timestamp values.

## How was this patch tested?

Added new testes to `CatalystTypeConvertersSuite` to check conversion of `TimestampType` to/from `java.time.Instant`.

Closes #23811 from MaxGekk/timestamp-instant.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 21:05:19 +08:00
Gengliang Wang 95e55720d4 [SPARK-26990][SQL] FileIndex: use user specified field names if possible
## What changes were proposed in this pull request?

WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

## How was this patch tested?

Unit test

Closes #23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 14:38:35 +08:00
Jungtaek Lim (HeartSaVioR) dea18ee85b [SPARK-22000][SQL] Address missing Upcast in JavaTypeInference.deserializerFor
## What changes were proposed in this pull request?

Spark expects the type of column and the type of matching field is same when deserializing to Object, but Spark hasn't actually restrict it (at least for Java bean encoder) and some users just do it and experience undefined behavior (in SPARK-22000, Spark throws compilation failure on generated code because it calls `.toString()` against primitive type.

It doesn't produce error in Scala side because `ScalaReflection.deserializerFor` properly inject Upcast if necessary. This patch proposes applying same thing to `JavaTypeInference.deserializerFor` as well.

Credit to srowen, maropu, and cloud-fan since they provided various approaches to solve this.

## How was this patch tested?

Added UT which query is slightly modified based on sample code in attachment on JIRA issue.

Closes #23854 from HeartSaVioR/SPARK-22000.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 13:47:20 +08:00
Hyukjin Kwon 88bc481b9e [SPARK-26830][SQL][R] Vectorized R dapply() implementation
## What changes were proposed in this pull request?

This PR targets to add vectorized `dapply()` in R, Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
```

**R code**

```r
rdf <- read.csv("500000.csv")
df <- cache(createDataFrame(rdf))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  count(cache(dapply(df, function(rdf) { rdf }, schema(df))))
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 13.42037 mins
```

```
Time difference of 30.64156 secs
```

The performance improvement was around **2627%**.

### Limitations

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Unit tests were added, and manually tested.

Closes #23787 from HyukjinKwon/SPARK-26830-1.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-27 14:29:58 +09:00
Liang-Chi Hsieh 0f2c0b53e8 [SPARK-26837][SQL] Pruning nested fields from object serializers
## What changes were proposed in this pull request?

In SPARK-26619, we make change to prune unnecessary individual serializers when serializing objects. This is extension to SPARK-26619. We can further prune nested fields from object serializers if they are not used.

For example, in following query, we only use one field in a struct column:

```scala
val data = Seq((("a", 1), 1), (("b", 2), 2), (("c", 3), 3))
val df = data.toDS().map(t => (t._1, t._2 + 1)).select("_1._1")
```

So, instead of having a serializer to create a two fields struct, we can prune unnecessary field from it. This is what this PR proposes to do.

In order to make this change conservative and safer, a SQL config is added to control it. It is disabled by default.

TODO: Support to prune nested fields inside MapType's key and value.

## How was this patch tested?

Added tests.

Closes #23740 from viirya/nested-pruning-serializer-2.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 12:45:24 +08:00
Maxim Gekk a2a41b7bf2 [SPARK-26978][CORE][SQL] Avoid magic time constants
## What changes were proposed in this pull request?

In the PR, I propose to refactor existing code related to date/time conversions, and replace constants like `1000` and `1000000` by `DateTimeUtils` constants and transformation functions from `java.util.concurrent.TimeUnit._`.

## How was this patch tested?

The changes are tested by existing test suites.

Closes #23878 from MaxGekk/magic-time-constants.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-26 09:08:12 -06:00
Xianyang Liu bc03c8b3fa [SPARK-26952][SQL] Row count statics should respect the data reported by data source
## What changes were proposed in this pull request?

In data source v2, if the data source scan implemented `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row count reported by the data source.

## How was this patch tested?

New UT test.

Closes #23853 from ConeyLiu/report-row-count.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-26 14:10:54 +08:00
Gengliang Wang 4baa2d4449 [SPARK-26673][FOLLOWUP][SQL] File Source V2: check existence of output path before delete it
## What changes were proposed in this pull request?
This is a followup PR to resolve comment: https://github.com/apache/spark/pull/23601#pullrequestreview-207101115

When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist,  it is suggested to follow the V1 code by checking the existence.

## How was this patch tested?

Apply https://github.com/apache/spark/pull/23836 and run unit tests

Closes #23889 from gengliangwang/checkFileBeforeOverwrite.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-02-25 16:20:06 -08:00
Douglas R Colkitt faa61980c4 [SPARK-26935][SQL] Skip DataFrameReader's CSV first line scan when not used
Prior to this patch, all DataFrameReader.csv() calls would collect the first
line from the CSV input iterator. This is done to allow schema inference from the
header row.

However when schema is already specified this is a wasteful operation. It results
in an unncessary compute step on the first partition. This can be expensive if
the CSV itself is expensive to generate (e.g. it's the product of a long-running
external pipe()).

This patch short-circuits the first-line collection in DataFrameReader.csv() when
schema is specified. Thereby improving CSV read performance in certain cases.

## What changes were proposed in this pull request?

Short-circuiting DataFrameReader.csv() first-line read when schema is user-specified.

## How was this patch tested?

Compiled and tested against several CSV datasets.

Closes #23830 from Mister-Meeseeks/master.

Authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-23 14:00:57 -06:00
Maxim Gekk 75c48ac36d [SPARK-26908][SQL] Fix DateTimeUtils.toMillis and millisToDays
## What changes were proposed in this pull request?

The `DateTimeUtils.toMillis` can produce inaccurate result for some negative values (timestamps before epoch). The error can be around 1ms. In the PR, I propose to use `Math.floorDiv` in casting microseconds to milliseconds, and milliseconds to days since epoch.

## How was this patch tested?

Added new test to `DateTimeUtilsSuite`, and tested by `CastSuite` as well.

Closes #23815 from MaxGekk/micros-to-millis.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-23 11:35:11 -06:00
Maxim Gekk d0f2fd05e1 [SPARK-26903][SQL] Remove the TimeZone cache
## What changes were proposed in this pull request?

In the PR, I propose to convert time zone string to `TimeZone` by converting it to `ZoneId` which uses `ZoneOffset` internally. The `ZoneOffset` class of JDK 8 has a cache already: http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/time/ZoneOffset.java#l205 . In this way, there is no need to support cache of time zones in Spark.

The PR removes `computedTimeZones` from `DateTimeUtils`, and uses `ZoneId.of` to convert time zone id string to `ZoneId` and to `TimeZone` at the end.

## How was this patch tested?

The changes were tested by

Closes #23812 from MaxGekk/timezone-cache.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-23 09:44:22 -06:00
Takeshi Yamamuro 967e4cb011 [SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard
## What changes were proposed in this pull request?
This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`.

NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities.

References:
 - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf
 - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html

## How was this patch tested?
Added tests in `TableIdentifierParserSuite`.

Closes #23259 from maropu/SPARK-26215-WIP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-02-23 08:38:47 +09:00
Sean Owen 95bb01282c [SPARK-26851][SQL][FOLLOWUP] Fix cachedColumnBuffers field for Scala 2.11 build
## What changes were proposed in this pull request?

Per https://github.com/apache/spark/pull/23768/files#r259083019 the last change to this line here caused the 2.11 build to fail. It's worked around by making `_cachedColumnBuffers` a field, as it was never set by callers to anything other than its default of null.

## How was this patch tested?

Existing tests.

Closes #23864 from srowen/SPARK-26851.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-02-22 15:22:52 +09:00
nandorKollar 066379783a [SPARK-26930][SQL] Tests in ParquetFilterSuite don't verify filter class
## What changes were proposed in this pull request?

Add assert to verify predicate class in ParquetFilterSuite

## How was this patch tested?

Ran ParquetFilterSuite, tests passed

Closes #23855 from nandorKollar/SPARK-26930.

Lead-authored-by: nandorKollar <nandorKollar@users.noreply.github.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Nandor Kollar <nkollar@cloudera.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-22 14:07:55 +08:00
Dongjoon Hyun ffef3d4074 [SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
## What changes were proposed in this pull request?

Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations.

```scala
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
```

Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes #23851 from dongjoon-hyun/SPARK-26950.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-22 12:25:26 +08:00
gengjiaan f9776e3892 [MINOR][SQL] Fix typo in exception about set table properties.
## What changes were proposed in this pull request?

The function of the method named verifyTableProperties is

`If the given table properties contains datasource properties, throw an exception. We will do this check when create or alter a table, i.e. when we try to write table metadata to Hive metastore.`

But the message of AnalysisException in verifyTableProperties contains one typo and one unsuited word.
So I change the exception from

`Cannot persistent ${table.qualifiedName} into hive metastore`

to

`Cannot persist ${table.qualifiedName} into Hive metastore`

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23574 from beliefer/incorrect-analysis-exception.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-21 22:13:47 -06:00
Dongjoon Hyun 6bd995b101 [SPARK-26958][SQL][TEST] Add NestedSchemaPruningBenchmark
## What changes were proposed in this pull request?

This adds `NestedSchemaPruningBenchmark` to show the nested schema pruning performance clearly and to verify new PR's performance benefit and to prevent the future performance degradation.

## How was this patch tested?

Manually run the benchmark.

```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.NestedSchemaPruningBenchmark"
```

Closes #23862 from dongjoon-hyun/SPARK-NESTED-SCHEMA-PRUNING-BM.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-02-21 23:39:36 +00:00
Dave DeCaprio 17d0cfcaa4 [SPARK-26917][SQL] Cache lock recache by condition
## What changes were proposed in this pull request?

Related to SPARK-26617 and SPARK-26548.  There was a new location we found where we were still seeing the locks.  We traced it to the recacheByCondition function.  In this PR I have changed that function so that the writeLock is not held while the condition is being evaluated.

cloud-fan & gatorsmile This is a further tweak to the other cache PRs we have done (which have helped us tremendously).

## How was this patch tested?

Has been tested on a live system where the blocking was causing major issues and it is working well.
CacheManager has no explicit unit test but is used in many places internally as part of the SharedState.

Closes #23833 from DaveDeCaprio/cache-lock-recacheByCondition.

Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu>
Co-authored-by: David DeCaprio <daved@alum.mit.edu>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-21 09:04:50 -06:00
Shixiong Zhu 77b99af573
[SPARK-26824][SS] Fix the checkpoint location and _spark_metadata when it contains special chars
## What changes were proposed in this pull request?

When a user specifies a checkpoint location or a file sink output using a path containing special chars that need to be escaped in a path, the streaming query will store checkpoint and file sink metadata in a wrong place. In this PR, I uploaded a checkpoint that was generated by the following codes using Spark 2.4.0 to show this issue:

```
implicit val s = spark.sqlContext
val input = org.apache.spark.sql.execution.streaming.MemoryStream[Int]
input.addData(1, 2, 3)
val q = input.toDF.writeStream.format("parquet").option("checkpointLocation", ".../chk %#chk").start(".../output %#output")
q.stop()
```
Here is the structure of the directory:
```
sql/core/src/test/resources/structured-streaming/escaped-path-2.4.0
├── chk%252520%252525%252523chk
│   ├── commits
│   │   └── 0
│   ├── metadata
│   └── offsets
│       └── 0
├── output %#output
│   └── part-00000-97f675a2-bb82-4201-8245-05f3dae4c372-c000.snappy.parquet
└── output%20%25%23output
    └── _spark_metadata
        └── 0
```

In this checkpoint, the user specified checkpoint location is `.../chk %#chk` but the real path to store the checkpoint is `.../chk%252520%252525%252523chk` (this is generated by escaping the original path three times). The user specified output path is `.../output %#output` but the path to store `_spark_metadata` is `.../output%20%25%23output/_spark_metadata` (this is generated by escaping the original path once). The data files are still in the correct path (such as `.../output %#output/part-00000-97f675a2-bb82-4201-8245-05f3dae4c372-c000.snappy.parquet`).

This checkpoint will be used in unit tests in this PR.

The fix is just simply removing improper `Path.toUri` calls to fix the issue.

However, as the user may not read the release note and is not aware of this checkpoint location change, if they upgrade Spark without moving checkpoint to the new location, their query will just start from the scratch. In order to not surprise the users, this PR also adds a check to **detect the impacted paths and throws an error** to include the migration guide. This check can be turned off by an internal sql conf `spark.sql.streaming.checkpoint.escapedPathCheck.enabled`. Here are examples of errors that will be reported:

- Streaming checkpoint error:
```
Error: we detected a possible problem with the location of your checkpoint and you
likely need to move it before restarting this query.

Earlier version of Spark incorrectly escaped paths when writing out checkpoints for
structured streaming. While this was corrected in Spark 3.0, it appears that your
query was started using an earlier version that incorrectly handled the checkpoint
path.

Correct Checkpoint Directory: /.../chk %#chk
Incorrect Checkpoint Directory: /.../chk%252520%252525%252523chk

Please move the data from the incorrect directory to the correct one, delete the
incorrect directory, and then restart this query. If you believe you are receiving
this message in error, you can disable it with the SQL conf
spark.sql.streaming.checkpoint.escapedPathCheck.enabled.
```

- File sink error (`_spark_metadata`):
```
Error: we detected a possible problem with the location of your "_spark_metadata"
directory and you likely need to move it before restarting this query.

Earlier version of Spark incorrectly escaped paths when writing out the
"_spark_metadata" directory for structured streaming. While this was corrected in
Spark 3.0, it appears that your query was started using an earlier version that
incorrectly handled the "_spark_metadata" path.

Correct "_spark_metadata" Directory: /.../output %#output/_spark_metadata
Incorrect "_spark_metadata" Directory: /.../output%20%25%23output/_spark_metadata

Please move the data from the incorrect directory to the correct one, delete the
incorrect directory, and then restart this query. If you believe you are receiving
this message in error, you can disable it with the SQL conf
spark.sql.streaming.checkpoint.escapedPathCheck.enabled.
```

## How was this patch tested?

The new unit tests.

Closes #23733 from zsxwing/path-fix.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-02-20 15:44:20 -08:00
Maxim Gekk 331ac60f28 [SPARK-26900][SQL] Simplify truncation to quarter of year
## What changes were proposed in this pull request?

In the PR, I propose to simplify timestamp truncation to quarter of year by using *java.time* API directly. The `LocalDate` instance can be truncation to quarter timestamp via adjusting by chrono field `IsoFields.DAY_OF_QUARTER`.

## How was this patch tested?

This was checked by existing test suite - `DateTimeUtilsSuite`.

Closes #23808 from MaxGekk/date-quarter-of-year.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-20 08:55:08 -06:00
Ivan Vergiliev 096552ae4d [SPARK-26859][SQL] Fix field writer index bug in non-vectorized ORC deserializer
## What changes were proposed in this pull request?

This happens in a schema evolution use case only when a user specifies the schema manually and use non-vectorized ORC deserializer code path.

There is a bug in `OrcDeserializer.scala` that results in `null`s being set at the wrong column position, and for state from previous records to remain uncleared in next records. There are more details for when exactly the bug gets triggered and what the outcome is in the [JIRA issue](https://jira.apache.org/jira/browse/SPARK-26859).

The high-level summary is that this bug results in severe data correctness issues, but fortunately the set of conditions to expose the bug are complicated and make the surface area somewhat small.

This change fixes the problem and adds a respective test.

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes #23766 from IvanVergiliev/fix-orc-deserializer.

Lead-authored-by: Ivan Vergiliev <ivan.vergiliev@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-20 21:49:38 +08:00
Hyukjin Kwon 3c15d8b71c [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame
## What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
collect(createDataFrame(mtcars))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
```

**R code**

```r
df <- cache(createDataFrame(read.csv("500000.csv")))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  collect(df)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 221.32014 secs
```

```
Time difference of 15.51145 secs
```

The performance improvement was around **1426%**.

### Limitations:

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Existing tests related with Arrow optimization cover this change. Also, manually tested.

Closes #23760 from HyukjinKwon/SPARK-26762.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-20 11:35:17 +08:00
Hyukjin Kwon ab850c02f7 [SPARK-26901][SQL][R] Adds child's output into references to avoid column-pruning for vectorized gapply()
## What changes were proposed in this pull request?

Currently, looks column pruning is done to vectorized `gapply()`. Given R native function could use all referred fields so it shouldn't be pruned. To avoid this, it adds child's output into `references` like `OutputConsumer`.

```
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
df <- createDataFrame(mtcars)
explain(count(groupBy(gapply(df,
                             "gear",
                             function(key, group) {
                               data.frame(gear = key[[1]], disp = mean(group$disp))
                             },
                             structType("gear double, disp double")))), TRUE)
```

**Before:**

```
== Optimized Logical Plan ==
Aggregate [count(1) AS count#41L]
+- Project
   +- FlatMapGroupsInRWithArrow [...]
      +- Project [gear#9]
         +- LogicalRDD [mpg#0, cyl#1, disp#2, hp#3, drat#4, wt#5, qsec#6, vs#7, am#8, gear#9, carb#10], false

== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)], output=[count#41L])
+- Exchange SinglePartition
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#44L])
      +- *(3) Project
         +- FlatMapGroupsInRWithArrow [...]
            +- *(2) Sort [gear#9 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(gear#9, 200)
                  +- *(1) Project [gear#9]
                     +- *(1) Scan ExistingRDD arrow[mpg#0,cyl#1,disp#2,hp#3,drat#4,wt#5,qsec#6,vs#7,am#8,gear#9,carb#10]
```

**After:**

```
== Optimized Logical Plan ==
Aggregate [count(1) AS count#91L]
+- Project
   +- FlatMapGroupsInRWithArrow [...]
      +- LogicalRDD [mpg#0, cyl#1, disp#2, hp#3, drat#4, wt#5, qsec#6, vs#7, am#8, gear#9, carb#10], false

== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)], output=[count#91L])
+- Exchange SinglePartition
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#94L])
      +- *(3) Project
         +- FlatMapGroupsInRWithArrow [...]
            +- *(2) Sort [gear#9 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(gear#9, 200)
                  +- *(1) Scan ExistingRDD arrow[mpg#0,cyl#1,disp#2,hp#3,drat#4,wt#5,qsec#6,vs#7,am#8,gear#9,carb#10]
```

Currently, it adds corrupt values for missing columns (via pruned columnar batches to Arrow writers that requires non-pruned columns) such as:

```r
...
  c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 0, 0, 4.17777978645388e-314)
  c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.04669129845114e+219)
  c(3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313,
  c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.47032822920623e-323)
...
```

which should be something like:

```r
...
  c(4, 4, 1, 2, 2, 4, 4, 1, 2, 1, 1, 2)
  c(26, 30.4, 15.8, 19.7, 15)
  c(4, 4, 8, 6, 8)
  c(120.3, 95.1, 351, 145, 301)
...
```

## How was this patch tested?

Manually tested, and unit tests were added.

The test code is basiaclly:

```r
df <- createDataFrame(mtcars)
count(gapply(df,
             c("gear"),
             function(key, group) {
                stopifnot(all(group$hp > 50))
                group
             },
             schema(df)))
```

`mtcars`'s hp is all more then 50.

```r
> mtcars$hp > 50
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE
```

However, due to corrpt value, (like 0 or 7.xxxxx), werid values were found. So, it's currently being failed as below in the master:

```
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 82 in stage 1.0 failed 1 times, most recent failure: Lost task 82.0 in stage 1.0 (TID 198, localhost, executor driver): org.apache.spark.SparkException: R worker exited unexpectedly (crashed)
 Error in computeFunc(key, inputData) : all(group$hp > 50) is not TRUE
Error in computeFunc(key, inputData) : all(group$hp > 50) is not TRUE
Error in computeFunc(key, inputData) : all(group$hp > 50) is not TRUE
```

I also compared the total length while I am here. Regular `gapply` without Arrow has some holes .. so I had to compare the results with R data frame.

Closes #23810 from HyukjinKwon/SPARK-26901.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-20 10:24:40 +08:00
yucai 743b73daf7 [SPARK-26909][FOLLOWUP][SQL] use unsafeRow.hashCode() as hash value in HashAggregate
## What changes were proposed in this pull request?

This is a followup PR for #21149.

New way uses unsafeRow.hashCode() as hash value in HashAggregate.
The unsafe row has [null bit set] etc., so the hash should be different from shuffle hash, and then we don't need a special seed.

## How was this patch tested?

UTs.

Closes #23821 from yucai/unsafe_hash.

Authored-by: yucai <yyu1@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-19 13:01:10 +08:00
Jungtaek Lim (HeartSaVioR) 865c88f9c7 [MINOR][DOC] Add note regarding proper usage of QueryExecution.toRdd
## What changes were proposed in this pull request?

This proposes adding a note on `QueryExecution.toRdd` regarding Spark's internal optimization callers would need to indicate.

## How was this patch tested?

This patch is a documentation change.

Closes #23822 from HeartSaVioR/MINOR-doc-add-note-query-execution-to-rdd.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-19 09:42:21 +08:00
Wenchen Fan f85ed9a3e5 [SPARK-26785][SQL] data source v2 API refactor: streaming write
## What changes were proposed in this pull request?

Continue the API refactor for streaming write, according to the [doc](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing).

The major changes:
1. rename `StreamingWriteSupport` to `StreamingWrite`
2. add `WriteBuilder.buildForStreaming`
3. update existing sinks, to move the creation of `StreamingWrite` to `Table`

## How was this patch tested?

existing tests

Closes #23702 from cloud-fan/stream-write.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-02-18 16:17:24 -08:00
Hyukjin Kwon a0e81fcfe8 [SPARK-26744][SPARK-26744][SQL][HOTFOX] Disable schema validation tests for FileDataSourceV2 (partially revert )
## What changes were proposed in this pull request?

This PR partially revert SPARK-26744.

60caa92dea and 4dce45a599 were merged at similar time range independently. So the test failures were not caught.

- 60caa92dea happened to add a schema reading logic in writing path for overwrite mode as well.

- 4dce45a599 added some tests with overwrite modes with migrated ORC v2.

And the tests looks starting to fail.

I guess the discussion won't be short (see https://github.com/apache/spark/pull/23606#discussion_r257675083) and this PR proposes to disable the tests added at 4dce45a599 to unblock other PRs for now.

## How was this patch tested?

Existing tests.

Closes #23828 from HyukjinKwon/SPARK-26744.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-18 21:13:00 +08:00
Yuming Wang 7f53116f77 [SPARK-24570][SQL] Implement Spark own GetTablesOperation to fix SQL client tools cannot show tables
## What changes were proposed in this pull request?

For SQL client tools([DBeaver](https://dbeaver.io/))'s Navigator use [`GetTablesOperation`](a744457076/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetTablesOperation.java) to obtain table names.

We should use [`metadataHive`](95d172da2b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala (L52-L53)), but it use [`executionHive`](24f5bbd770/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala (L93-L95)).

This PR implement Spark own `GetTablesOperation` to use `metadataHive`.

## How was this patch tested?

unit test and manual tests

![image](https://user-images.githubusercontent.com/5399861/47430696-acf77980-d7cc-11e8-824d-f28d78f60a00.png)
![image](https://user-images.githubusercontent.com/5399861/47440576-09649400-d7e1-11e8-97a8-a96f73f70361.png)

Closes #22794 from wangyum/SPARK-24570.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-02-17 23:35:45 -08:00
Ryan Blue 60caa92dea [SPARK-26666][SQL] Support DSv2 overwrite and dynamic partition overwrite.
## What changes were proposed in this pull request?

This adds two logical plans that implement the ReplaceData operation from the [logical plans SPIP](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d). These two plans will be used to implement Spark's `INSERT OVERWRITE` behavior for v2.

Specific changes:
* Add `SupportsTruncate`, `SupportsOverwrite`, and `SupportsDynamicOverwrite` to DSv2 write API
* Add `OverwriteByExpression` and `OverwritePartitionsDynamic` plans (logical and physical)
* Add new plans to DSv2 write validation rule `ResolveOutputRelation`
* Refactor `WriteToDataSourceV2Exec` into trait used by all DSv2 write exec nodes

## How was this patch tested?

* The v2 analysis suite has been updated to validate the new overwrite plans
* The analysis suite for `OverwriteByExpression` checks that the delete expression is resolved using the table's columns
* Existing tests validate that overwrite exec plan works
* Updated existing v2 test because schema is used to validate overwrite

Closes #23606 from rdblue/SPARK-26666-add-overwrite.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-18 13:16:28 +08:00