Commit graph

5943 commits

Author SHA1 Message Date
Holden Karau 42050c3f4f [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator
### What changes were proposed in this pull request?

This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not.

### Why are the changes needed?

In https://issues.apache.org/jira/browse/SPARK-23961 / 5e79ae3b40 we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking.

### Does this PR introduce any user-facing change?

A new param is added to toLocalIterator

### How was this patch tested?

New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`.

I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case.

>
> 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> Running timers:
>
> [Stage 32:>                                                         (0 + 1) / 1]
> Results:
>
> Prefetch time:
>
> 100.228110831
>
>
> Regular time:
>
> 188.341721614
>
>
>

Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2019-09-20 09:59:31 -07:00
Burak Yavuz eb7ee6834d [SPARK-29062][SQL] Add V1_BATCH_WRITE to the TableCapabilityChecks
### What changes were proposed in this pull request?

Currently the checks in the Analyzer require that V2 Tables have BATCH_WRITE defined for all tables that have V1 Write fallbacks. This is confusing as these tables may not have the V2 writer interface implemented yet. This PR adds this table capability to these checks.

In addition, this allows V2 tables to leverage the V1 APIs for DataFrameWriter.save if they do extend the V1_BATCH_WRITE capability. This way, these tables can continue to receive partitioning information and also perform checks for the existence of tables, and support all SaveModes.

### Why are the changes needed?

Partitioned saves through DataFrame.write are otherwise broken for V2 tables that support the V1
write API.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

V1WriteFallbackSuite

Closes #25767 from brkyvz/bwcheck.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-20 22:04:32 +08:00
Takeshi Yamamuro ec8a1a8e88 [SPARK-29122][SQL] Propagate all the SQL conf to executors in SQLQueryTestSuite
### What changes were proposed in this pull request?

This pr is to propagate all the SQL configurations to executors in `SQLQueryTestSuite`. When the propagation enabled in the tests, a potential bug below becomes apparent;
```
CREATE TABLE num_data (id int, val decimal(38,10)) USING parquet;
....
 select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4): QueryOutput(select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4),struct<>,java.lang.IllegalArgumentException
[info]   requirement failed: MutableProjection cannot use UnsafeRow for output data types: decimal(38,0)) (SQLQueryTestSuite.scala:380)
```
The root culprit is that `InterpretedMutableProjection` has incorrect validation in the interpreter mode: `validExprs.forall { case (e, _) => UnsafeRow.isFixedLength(e.dataType) }`. This validation should be the same with the condition (`isMutable`) in `HashAggregate.supportsAggregate`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126

### Why are the changes needed?

Bug fixes.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added tests in `AggregationQuerySuite`

Closes #25831 from maropu/SPARK-29122.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-20 21:41:09 +09:00
Jungtaek Lim (HeartSaVioR) 5e92301723 [SPARK-29161][CORE][SQL][STREAMING] Unify default wait time for waitUntilEmpty
### What changes were proposed in this pull request?

This is a follow-up of the [review comment](https://github.com/apache/spark/pull/25706#discussion_r321923311).

This patch unifies the default wait time to be 10 seconds as it would fit most of UTs (as they have smaller timeouts) and doesn't bring additional latency since it will return if the condition is met.

This patch doesn't touch the one which waits 100000 milliseconds (100 seconds), to not break anything unintentionally, though I'd rather questionable that we really need to wait for 100 seconds.

### Why are the changes needed?

It simplifies the test code and get rid of various heuristic values on timeout.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

CI build will test the patch, as it would be the best environment to test the patch (builds are running there).

Closes #25837 from HeartSaVioR/MINOR-unify-default-wait-time-for-wait-until-empty.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-19 23:11:54 -07:00
Dongjoon Hyun 5b478416f8 [SPARK-28208][SQL][FOLLOWUP] Use tryWithResource pattern
### What changes were proposed in this pull request?

This PR aims to use `tryWithResource` for ORC file.

### Why are the changes needed?

This is a follow-up to address https://github.com/apache/spark/pull/25006#discussion_r298788206 .

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #25842 from dongjoon-hyun/SPARK-28208.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-19 15:33:12 -07:00
Ryan Blue 2c775f418f [SPARK-28612][SQL] Add DataFrameWriterV2 API
## What changes were proposed in this pull request?

This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:

* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.

Here are a few example uses of the new API:

```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```

## How was this patch tested?

Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.

Closes #25681 from rdblue/SPARK-28612-add-data-frame-writer-v2.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-09-19 13:32:09 -07:00
Gengliang Wang b917a6593d [SPARK-28989][SQL] Add a SQLConf spark.sql.ansi.enabled
### What changes were proposed in this pull request?
Currently, there are new configurations for compatibility with ANSI SQL:

* `spark.sql.parser.ansi.enabled`
* `spark.sql.decimalOperations.nullOnOverflow`
* `spark.sql.failOnIntegralTypeOverflow`
This PR is to add new configuration `spark.sql.ansi.enabled` and remove the 3 options above. When the configuration is true, Spark tries to conform to the ANSI SQL specification. It will be disabled by default.

### Why are the changes needed?

Make it simple and straightforward.

### Does this PR introduce any user-facing change?

The new features for ANSI compatibility will be set via one configuration `spark.sql.ansi.enabled`.

### How was this patch tested?

Existing unit tests.

Closes #25693 from gengliangwang/ansiEnabled.

Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-18 22:30:28 -07:00
Maxim Gekk a6a663c437 [SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks
### What changes were proposed in this pull request?

Refactored SQL-related benchmark and made them depend on `SqlBasedBenchmark`. In particular, creation of Spark session are moved into `override def getSparkSession: SparkSession`.

### Why are the changes needed?

This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

By running the modified benchmarks.

Closes #25828 from MaxGekk/sql-benchmarks-refactoring.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-18 17:52:23 -07:00
bartosz25 b4b2e958ce [MINOR][SS][DOCS] Adapt multiple watermark policy comment to the reality
### What changes were proposed in this pull request?

Previous comment was true for Apache Spark 2.3.0. The 2.4.0 release brought multiple watermark policy and therefore stating that the 'min' is always chosen is misleading.

This PR updates the comments about multiple watermark policy. They aren't true anymore since in case of multiple watermarks, we can configure which one will be applied to the query. This change was brought with Apache Spark 2.4.0 release.

### Why are the changes needed?

It introduces some confusion about the real execution of the commented code.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

The tests weren't added because the change is only about the documentation level. I affirm that the contribution is my original work and that I license the work to the project under the project's open source license.

Closes #25832 from bartosz25/fix_comments_multiple_watermark_policy.

Authored-by: bartosz25 <bartkonieczny@yahoo.fr>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-18 10:51:11 -07:00
Owen O'Malley dfb0a8bb04 [SPARK-28208][BUILD][SQL] Upgrade to ORC 1.5.6 including closing the ORC readers
## What changes were proposed in this pull request?

It upgrades ORC from 1.5.5 to 1.5.6 and adds closes the ORC readers when they aren't used to
create RecordReaders.

## How was this patch tested?

The changed unit tests were run.

Closes #25006 from omalley/spark-28208.

Lead-authored-by: Owen O'Malley <omalley@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-18 09:32:43 -07:00
sandeep katta 376e17c082 [SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected
### What changes were proposed in this pull request?
#DataSet
fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx

This PR aims to fix the below
```
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4
```

This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645).
SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387)

### Why are the changes needed?

SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387

### Does this PR introduce any user-facing change?
No,

### How was this patch tested?
Added UT, and also tested the bug SPARK-24645

**SPARK-24645 regression**
![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png)

Closes #25820 from sandeep-katta/SPARK-29101.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-18 23:33:13 +09:00
Maxim Gekk c2734ab1fc [SPARK-29012][SQL] Support special timestamp values
### What changes were proposed in this pull request?

Supported special string values for `TIMESTAMP` type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported:
- `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)`
- `today [zoneId]` - midnight today.
- `yesterday [zoneId]` -midnight yesterday
- `tomorrow [zoneId]` - midnight tomorrow
- `now` - current query start time.

For example:
```sql
spark-sql> SELECT timestamp 'tomorrow';
2019-09-07 00:00:00
```

### Why are the changes needed?

To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html)

### Does this PR introduce any user-facing change?

Previously, the parser fails on the special values with the error:
```sql
spark-sql> select timestamp 'today';
Error in query:
Cannot parse the TIMESTAMP value: today(line 1, pos 7)
```
After the changes, the special values are converted to appropriate dates:
```sql
spark-sql> select timestamp 'today';
2019-09-06 00:00:00
```

### How was this patch tested?
- Added tests to `TimestampFormatterSuite` to check parsing special values from regular strings.
- Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String`
- Uncommented tests in `timestamp.sql`

Closes #25716 from MaxGekk/timestamp-special-values.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-18 23:30:59 +09:00
Gengliang Wang 3da2786dc6 [SPARK-29096][SQL] The exact math method should be called only when there is a corresponding function in Math
### What changes were proposed in this pull request?

1. After https://github.com/apache/spark/pull/21599, if the option "spark.sql.failOnIntegralTypeOverflow" is enabled, all the Binary Arithmetic operator will used the exact version function.
However, only `Add`/`Substract`/`Multiply` has a corresponding exact function in java.lang.Math . When the option "spark.sql.failOnIntegralTypeOverflow" is enabled, a runtime exception "BinaryArithmetics must override either exactMathMethod or genCode" is thrown if the other Binary Arithmetic operators are used, such as "Divide", "Remainder".
The exact math method should be called only when there is a corresponding function in `java.lang.Math`
2. Revise the log output of casting to `Int`/`Short`
3. Enable `spark.sql.failOnIntegralTypeOverflow` for pgSQL tests in `SQLQueryTestSuite`.

### Why are the changes needed?

1. Fix the bugs of https://github.com/apache/spark/pull/21599
2. The test case of pgSQL intends to check the overflow of integer/long type. We should enable `spark.sql.failOnIntegralTypeOverflow`.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes #25804 from gengliangwang/enableIntegerOverflowInSQLTest.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-18 16:59:17 +08:00
turbofei eef5e6d348 [SPARK-29113][DOC] Fix some annotation errors and remove meaningless annotations in project
### What changes were proposed in this pull request?

In this PR, I fix some annotation errors and remove meaningless annotations in project.
### Why are the changes needed?
There are some annotation errors and meaningless annotations in project.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Verified manually.

Closes #25809 from turboFei/SPARK-29113.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-18 13:12:18 +09:00
Chris Martin 05988b256e [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs
### What changes were proposed in this pull request?

Adds a new cogroup Pandas UDF.  This allows two grouped dataframes to be cogrouped together and apply a (pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame UDF to each cogroup.

**Example usage**

```
from pyspark.sql.functions import pandas_udf, PandasUDFType
df1 = spark.createDataFrame(
   [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
   ("time", "id", "v1"))

df2 = spark.createDataFrame(
   [(20000101, 1, "x"), (20000101, 2, "y")],
    ("time", "id", "v2"))

pandas_udf("time int, id int, v1 double, v2 string", PandasUDFType.COGROUPED_MAP)
   def asof_join(l, r):
      return pd.merge_asof(l, r, on="time", by="id")

df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show()

```

        +--------+---+---+---+
        |    time| id| v1| v2|
        +--------+---+---+---+
        |20000101|  1|1.0|  x|
        |20000102|  1|3.0|  x|
        |20000101|  2|2.0|  y|
        |20000102|  2|4.0|  y|
        +--------+---+---+---+

### How was this patch tested?

Added unit test test_pandas_udf_cogrouped_map

Closes #24981 from d80tb7/SPARK-27463-poc-arrow-stream.

Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-09-17 17:13:50 -07:00
Maxim Gekk 02db706090 [SPARK-29115][SQL][TEST] Add benchmarks for make_date() and make_timestamp()
### What changes were proposed in this pull request?

Added new benchmarks for `make_date()` and `make_timestamp()` to detect performance issues, and figure out functions speed on foldable arguments.
- `make_date()` is benchmarked on fully foldable arguments.
- `make_timestamp()` is benchmarked on corner case `60.0`, foldable time fields and foldable date.

### Why are the changes needed?

To find out inputs where `make_date()` and `make_timestamp()` have performance problems. This should be useful in the future optimizations of the functions and users apps.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the benchmark and manually checking generated dates/timestamps.

Closes #25813 from MaxGekk/make_datetime-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-17 15:09:16 -07:00
xy_xin 3fc52b5557 [SPARK-28950][SQL] Refine the code of DELETE
### What changes were proposed in this pull request?
This pr refines the code of DELETE, including, 1, make `whereClause` to be optional, in which case DELETE will delete all of the data of a table; 2, add more test cases; 3, some other refines.
This is a following-up of SPARK-28351.

### Why are the changes needed?
An optional where clause in DELETE respects the SQL standard.

### Does this PR introduce any user-facing change?
Yes. But since this is a non-released feature, this change does not have any end-user affects.

### How was this patch tested?
New case is added.

Closes #25652 from xianyinxin/SPARK-28950.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-18 01:14:14 +08:00
Maxim Gekk db996ccad9 [SPARK-29074][SQL] Optimize date_format for foldable fmt
### What changes were proposed in this pull request?

In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable.

### Why are the changes needed?

The changes improve performance of the `date_format()` function.

Before:
```
format date:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
format date wholestage off                    7180 / 7181          1.4         718.0       1.0X
format date wholestage on                     7051 / 7194          1.4         705.1       1.0X
```

After:
```
format date:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
format date wholestage off                    4787 / 4839          2.1         478.7       1.0X
format date wholestage on                     4736 / 4802          2.1         473.6       1.0X
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes #25782 from MaxGekk/date_format-foldable.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-17 16:00:10 +09:00
Takeshi Yamamuro 95073fb62b [SPARK-29008][SQL] Define an individual method for each common subexpression in HashAggregateExec
### What changes were proposed in this pull request?

This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; 4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)

The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`,

the current master generates;
```
/* 098 */   private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 099 */     // do aggregate
/* 100 */     // common sub-expressions
/* 101 */     int agg_value_6 = -1;
/* 102 */
/* 103 */     agg_value_6 = agg_expr_0_0 + agg_expr_1_0;
/* 104 */
/* 105 */     int agg_value_5 = -1;
/* 106 */
/* 107 */     agg_value_5 = agg_value_6 + agg_expr_2_0;
/* 108 */     boolean agg_isNull_4 = false;
/* 109 */     long agg_value_4 = -1L;
/* 110 */     if (!false) {
/* 111 */       agg_value_4 = (long) agg_value_5;
/* 112 */     }
/* 113 */     int agg_value_10 = -1;
/* 114 */
/* 115 */     agg_value_10 = agg_expr_0_0 + agg_expr_1_0;
/* 116 */     // evaluate aggregate functions and update aggregation buffers
/* 117 */     agg_doAggregate_sum_0(agg_value_10);
/* 118 */     agg_doAggregate_avg_0(agg_value_4, agg_isNull_4);
/* 119 */
/* 120 */   }
```

On the other hand, this pr generates;
```
/* 121 */   private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 122 */     // do aggregate
/* 123 */     // common sub-expressions
/* 124 */     long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0);
/* 125 */     int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0);
/* 126 */     // evaluate aggregate functions and update aggregation buffers
/* 127 */     agg_doAggregate_sum_0(agg_subExprValue_1);
/* 128 */     agg_doAggregate_avg_0(agg_subExprValue_0);
/* 129 */
/* 130 */   }
```

I run some micro benchmarks for this pr;
```
(base) maropu~:$system_profiler SPHardwareDataType
Hardware:
    Hardware Overview:
      Processor Name: Intel Core i5
      Processor Speed: 2 GHz
      Number of Processors: 1
      Total Number of Cores: 2
      L2 Cache (per Core): 256 KB
      L3 Cache: 4 MB
      Memory: 8 GB

(base) maropu~:$java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

(base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v

val numCols = 40
val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" }
spark.range(3000000).selectExpr(colExprs: _*).createOrReplaceTempView("t")

val aggExprs = (2 until numCols).map { i =>
  (0 until i).map(d => s"_c$d")
    .mkString("AVG(", " + ", ")")
}

// Drops the time of a first run then pick that of a second run
timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() }

// the master
maxCodeGen: 12957
Elapsed time: 36.309858661s

// this pr
maxCodeGen=4184
Elapsed time: 2.399490285s
```

### Why are the changes needed?

To avoid the too-long-function issue in JVMs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests in `WholeStageCodegenSuite`

Closes #25710 from maropu/SplitSubexpr.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-17 11:09:55 +09:00
Takeshi Yamamuro 6297287dfa [SPARK-29061][SQL] Prints bytecode statistics in debugCodegen
### What changes were proposed in this pull request?

This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following;
```
scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...
```

### Why are the changes needed?

For efficient developments

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually tested

Closes #25766 from maropu/PrintBytecodeStats.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-16 21:48:07 +08:00
Maxim Gekk 1b7afc0c98 [SPARK-28471][SQL][DOC][FOLLOWUP] Fix year patterns in the comments of date-time expressions
### What changes were proposed in this pull request?

In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one.

### Why are the changes needed?

To make comments consistent to implementations.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

By running Scala Style checker.

Closes #25796 from MaxGekk/year-pattern-uuuu-followup.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 11:02:15 -07:00
Dongjoon Hyun 13b77e52d2 Revert "[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping"
This reverts commit 850833fa17.
2019-09-14 00:09:45 -07:00
WeichenXu 5631a96367 [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection
### What changes were proposed in this pull request?
The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions.

### Why are the changes needed?
`Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually benchmark it in spark-shell:
```
def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis()
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}
```
Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size | explain time (before) | explain time (after)
------ | ------ | ------
5 | 26ms | 29ms
10 | 30ms | 48ms
100 | 104ms | 50ms
1000 | 1202ms | 58ms
10000 | 10012ms | 523ms

Closes #25754 from WeichenXu123/improve_in_collection.

Lead-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-12 17:23:08 -07:00
maryannxue c56a012bc8 [SPARK-29060][SQL] Add tree traversal helper for adaptive spark plans
### What changes were proposed in this pull request?
This PR adds a utility class `AdaptiveSparkPlanHelper` which provides methods related to tree traversal of an `AdaptiveSparkPlanExec` plan. Unlike their counterparts in `TreeNode` or
`QueryPlan`, these methods traverse down leaf nodes of adaptive plans, i.e., `AdaptiveSparkPlanExec` and `QueryStageExec`.

### Why are the changes needed?
This utility class can greatly simplify tree traversal code for adaptive spark plans.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Refined `AdaptiveQueryExecSuite` with the help of the new utility methods.

Closes #25764 from maryannxue/aqe-utils.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-12 21:49:21 +08:00
Maxim Gekk 8e9fafbb21 [SPARK-29065][SQL][TEST] Extend EXTRACT benchmark
### What changes were proposed in this pull request?

In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column

### Why are the changes needed?

The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR https://github.com/apache/spark/pull/25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.

### Does this PR introduce any user-facing change?

No, it doesn't.

### How was this patch tested?
- Regenerated results of `ExtractBenchmark`

Closes #25772 from MaxGekk/date_part-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-12 21:32:35 +09:00
Wenchen Fan 053dd858d3 [SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes
### What changes were proposed in this pull request?

reorganize the packages of DS v2 interfaces/classes:
1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes
2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes
3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes
4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes

### Why are the changes needed?

Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #25700 from cloud-fan/package.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-12 19:59:34 +08:00
LantaoJin 6768431c97 [SPARK-29045][SQL][TESTS] Drop table to avoid test failure in SQLMetricsSuite
### What changes were proposed in this pull request?

In method `SQLMetricsTestUtils.testMetricsDynamicPartition()`, there is a CREATE TABLE sentence without `withTable` block. It causes test failure if use same table name in other unit tests.

### Why are the changes needed?
To avoid "table already exists" in tests.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exist UT

Closes #25752 from LantaoJin/SPARK-29045.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-11 23:05:03 -07:00
Jungtaek Lim (HeartSaVioR) 850833fa17 [SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping
# What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.

Closes #25753 from HeartSaVioR/SPARK-29046.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 11:16:33 +09:00
dengziming 8f632d7045 [MINOR][DOCS] Fix few typos in the java docs
JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into  'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'

Closes #25756 from dengziming/feature_fix_typos.

Authored-by: dengziming <dengziming@growingio.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 09:30:03 +09:00
Wenchen Fan eec728a0d4 [SPARK-29057][SQL] remove InsertIntoTable
### What changes were proposed in this pull request?

Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement`

### Why are the changes needed?

`InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes:
1. parser creates `InsertIntoStatement`
2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands.
3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #25763 from cloud-fan/remove.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 09:24:36 +09:00
Terry Kim bf43541c92 [SPARK-28856][SQL] Implement SHOW DATABASES for Data Source V2 Tables
### What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.

### Why are the changes needed?
To support `SHOW DATABASES` SQL commands for v2 tables.

### Does this PR introduce any user-facing change?
`spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set:
```
+---------------+
|      namespace|
+---------------+
|            ns1|
|      ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+
```

### How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.

Closes #25601 from imback82/show_databases.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-10 21:23:57 +08:00
Marco Gaido ca6f693ef1 [SPARK-28939][SQL][FOLLOWUP] Avoid useless Properties
### What changes were proposed in this pull request?

Removes useless `Properties` created according to hvanhovell 's suggestion.

### Why are the changes needed?

Avoid useless code.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

existing UTs

Closes #25742 from mgaido91/SPARK-28939_followup.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-10 20:47:55 +09:00
Dongjoon Hyun 580c6266fb [SPARK-28939][SQL][FOLLOWUP] Fix JDK11 compilation due to ambiguous reference
### What changes were proposed in this pull request?

This PR aims to recover the JDK11 compilation with a workaround.
For now, the master branch is broken like the following due to a [Scala bug](https://github.com/scala/bug/issues/10418) which is fixed in `2.13.0-RC2`.
```
[ERROR] [Error] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition,
both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit
and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit
match argument types (java.util.Map[String,String])
```

- https://github.com/apache/spark/actions (JDK11 build monitoring)

### Why are the changes needed?

This workaround recovers JDK11 compilation.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual build with JDK11 because this is JDK11 compilation fix.
- Jenkins builds with JDK8 and tests with JDK11.
- GitHub action will verify this after merging.

Closes #25738 from dongjoon-hyun/SPARK-28939.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-09 20:30:49 -07:00
Wenchen Fan c2d8ee9c54 [SPARK-28878][SQL][FOLLOWUP] Remove extra project for DSv2 streaming scan
### What changes were proposed in this pull request?

Remove the project node if the streaming scan is columnar

### Why are the changes needed?

This is a followup of https://github.com/apache/spark/pull/25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply #25586 to streaming scan as well.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #25727 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-10 11:01:57 +08:00
gengjiaan aafce7ebff [SPARK-28412][SQL] ANSI SQL: OVERLAY function support byte array
## What changes were proposed in this pull request?

This is a ANSI SQL and feature id is `T312`

```
<binary overlay function> ::=
OVERLAY <left paren> <binary value expression> PLACING <binary value expression>
FROM <start position> [ FOR <string length> ] <right paren>
```

This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array.

ref: https://www.postgresql.org/docs/11/functions-binarystring.html

## How was this patch tested?

new UT.
There are some show of the PR on my production environment.
```
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
Spark_SQL
Time taken: 0.285 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
Spark CORE
Time taken: 0.202 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
Spark ANSI SQL
Time taken: 0.165 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
Structured SQL
Time taken: 0.141 s
```

Closes #25172 from beliefer/ansi-overlay-byte-array.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-10 08:16:18 +09:00
Sean Owen 6378d4bc06 [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3
### What changes were proposed in this pull request?

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc

Notes:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.

### Why are the changes needed?

Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.

### Does this PR introduce any user-facing change?

Yes, in that deprecated items are removed from some public APIs.

### How was this patch tested?

Existing tests.

Closes #25684 from srowen/SPARK-28980.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-09 10:19:40 -05:00
Marco Gaido 3d6b33a49a [SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd
### What changes were proposed in this pull request?

The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it.

In this way, SQL configs are effective also in these cases, while earlier they were ignored.

### Why are the changes needed?

Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be:
```
  withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
    val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
    df.createOrReplaceTempView("spark64kb")
    val data = spark.sql("select * from spark64kb limit 10")
    // Subexpression elimination is used here, despite it should have been disabled
    data.describe()
  }
```

### Does this PR introduce any user-facing change?

When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy.

### How was this patch tested?

added UT

Closes #25643 from mgaido91/SPARK-28939.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-09 21:20:34 +08:00
Wenchen Fan abec6d7763 [SPARK-28341][SQL] create a public API for V2SessionCatalog
## What changes were proposed in this pull request?

The `V2SessionCatalog` has 2 functionalities:
1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`.
2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog).

To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage.

This PR does 2 things:
1. refine the document of the config `spark.sql.catalog.session`.
2. add a public abstract class `CatalogExtension` for users to write implementations.

TODOs for followup PRs:
1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one.
2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names.

## How was this patch tested?

existing tests

Closes #25104 from cloud-fan/session-catalog.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-09 21:14:37 +08:00
turbofei d4eca7c99d [SPARK-29000][SQL] Decimal precision overflow when don't allow precision loss
### What changes were proposed in this pull request?

When we set spark.sql.decimalOperations.allowPrecisionLoss to false.

For the sql below, the result will overflow and return null.

Case a:

`select case when 1=2 then 1 else 1.000000000000000000000001 end * 1`
Similar  with the division operation.

This sql below will lost precision.

Case b:

`select case when 1=2 then 1 else 1.000000000000000000000001 end / 1`

Let us check the code of TypeCoercion.scala.

 a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875).

For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a  common type and cast both operands to common type.

So, for these cases menthioned,  their left operand is Decimal(34, 24) and right operand is Literal.

Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24).

Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class.

Let's check the relative code.

a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)

When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20).

Then the multi operation in case a will overflow and division operation in case b will lost precision.

In this PR, we skip to handle the  binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it.

### Why are the changes needed?

Data will corrupt without this change.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #25701 from turboFei/SPARK-29000.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-09 13:50:17 +08:00
Yuming Wang a75467432e [SPARK-28000][SQL][TEST] Port comments.sql
## What changes were proposed in this pull request?

This PR is to port comments.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/comments.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/expected/comments.out

When porting the test cases, found one PostgreSQL specific features that do not exist in Spark SQL:
[SPARK-28880](https://issues.apache.org/jira/browse/SPARK-28880): ANSI SQL: Bracketed comments

## How was this patch tested?

N/A

Closes #25588 from wangyum/SPARK-28000.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-08 10:32:08 +09:00
Takeshi Yamamuro ff5fa5873e [SPARK-21870][SQL][FOLLOW-UP] Clean up string template formats for generated code in HashAggregateExec
### What changes were proposed in this pull request?

This pr cleans up string template formats for generated code in HashAggregateExec. This changes comes from rednaxelafx comment: https://github.com/apache/spark/pull/20965#discussion_r316418729

### Why are the changes needed?

To improve code-readability.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #25714 from maropu/SPARK-21870-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-07 07:16:36 +09:00
maryannxue b2f06608b7 [SPARK-29002][SQL] Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions
### What changes were proposed in this pull request?
This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ.

### Why are the changes needed?
It is a performance improvement for AQE.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #25703 from maryannxue/aqe-demote-bhj.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-06 12:46:54 -07:00
Maxim Gekk 67b4329fb0 [SPARK-28690][SQL] Add date_part function for timestamps/dates
## What changes were proposed in this pull request?

In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`:
```
date_part('field', source)
```
and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).

The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are:
- `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
- `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
- `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
- isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
- `year`, `month`, `day`, `hour`, `minute`, `second`
- `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
- `quarter` - the quarter of the year (1 - 4)
- `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
- `dow` - the day of the week as Sunday (0) to Saturday (6)
- `isodow` - the day of the week as Monday (1) to Sunday (7)
- `doy` - the day of the year (1 - 365/366)
- `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
- `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
- `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.

Here are examples:
```sql
spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224
```

I changed implementation of `extract` to re-use `date_part()` internally.

## How was this patch tested?

Added `date_part.sql` and regenerated results of `extract.sql`.

Closes #25410 from MaxGekk/date_part.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-06 23:36:00 +09:00
Takeshi Yamamuro cb0cddffe9 [SPARK-21870][SQL] Split aggregation code into small functions
## What changes were proposed in this pull request?
This pr proposed to split aggregation code into small functions in `HashAggregateExec`. In #18810, we got performance regression if JVMs didn't compile too long functions. I checked and I found the codegen of `HashAggregateExec` frequently goes over the limit when a query has too many aggregate functions (e.g., q66 in TPCDS).

The current master places all the generated aggregation code in a single function. In this pr, I modified the code to assign an individual function for each aggregate function (e.g., `SUM`
 and `AVG`). For example, in a query
`SELECT SUM(a), AVG(a) FROM VALUES(1) t(a)`, the proposed code defines two functions
for `SUM(a)` and `AVG(a)` as follows;

- generated  code with this pr (https://gist.github.com/maropu/812990012bc967a78364be0fa793f559):
```
/* 173 */   private void agg_doConsume_0(InternalRow inputadapter_row_0, long agg_expr_0_0, boolean agg_exprIsNull_0_0, double agg_expr_1_0, boolean agg_exprIsNull_1_0, long agg_expr_2_0, boolean agg_exprIsNull_2_0) throws java.io.IOException {
/* 174 */     // do aggregate
/* 175 */     // common sub-expressions
/* 176 */
/* 177 */     // evaluate aggregate functions and update aggregation buffers
/* 178 */     agg_doAggregate_sum_0(agg_exprIsNull_0_0, agg_expr_0_0);
/* 179 */     agg_doAggregate_avg_0(agg_expr_1_0, agg_exprIsNull_1_0, agg_exprIsNull_2_0, agg_expr_2_0);
/* 180 */
/* 181 */   }
...
/* 071 */   private void agg_doAggregate_avg_0(double agg_expr_1_0, boolean agg_exprIsNull_1_0, boolean agg_exprIsNull_2_0, long agg_expr_2_0) throws java.io.IOException {
/* 072 */     // do aggregate for avg
/* 073 */     // evaluate aggregate function
/* 074 */     boolean agg_isNull_19 = true;
/* 075 */     double agg_value_19 = -1.0;
...
/* 114 */   private void agg_doAggregate_sum_0(boolean agg_exprIsNull_0_0, long agg_expr_0_0) throws java.io.IOException {
/* 115 */     // do aggregate for sum
/* 116 */     // evaluate aggregate function
/* 117 */     agg_agg_isNull_11_0 = true;
/* 118 */     long agg_value_11 = -1L;
```

- generated code in the current master (https://gist.github.com/maropu/e9d772af2c98d8991a6a5f0af7841760)
```
/* 059 */   private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0) throws java.io.IOException {
/* 060 */     // do aggregate
/* 061 */     // common sub-expressions
/* 062 */     boolean agg_isNull_4 = false;
/* 063 */     long agg_value_4 = -1L;
/* 064 */     if (!false) {
/* 065 */       agg_value_4 = (long) agg_expr_0_0;
/* 066 */     }
/* 067 */     // evaluate aggregate function
/* 068 */     agg_agg_isNull_7_0 = true;
/* 069 */     long agg_value_7 = -1L;
/* 070 */     do {
/* 071 */       if (!agg_bufIsNull_0) {
/* 072 */         agg_agg_isNull_7_0 = false;
/* 073 */         agg_value_7 = agg_bufValue_0;
/* 074 */         continue;
/* 075 */       }
/* 076 */
/* 077 */       boolean agg_isNull_9 = false;
/* 078 */       long agg_value_9 = -1L;
/* 079 */       if (!false) {
/* 080 */         agg_value_9 = (long) 0;
/* 081 */       }
/* 082 */       if (!agg_isNull_9) {
/* 083 */         agg_agg_isNull_7_0 = false;
/* 084 */         agg_value_7 = agg_value_9;
/* 085 */         continue;
/* 086 */       }
/* 087 */
/* 088 */     } while (false);
/* 089 */
/* 090 */     long agg_value_6 = -1L;
/* 091 */
/* 092 */     agg_value_6 = agg_value_7 + agg_value_4;
/* 093 */     boolean agg_isNull_11 = true;
/* 094 */     double agg_value_11 = -1.0;
/* 095 */
/* 096 */     if (!agg_bufIsNull_1) {
/* 097 */       agg_agg_isNull_13_0 = true;
/* 098 */       double agg_value_13 = -1.0;
/* 099 */       do {
/* 100 */         boolean agg_isNull_14 = agg_isNull_4;
/* 101 */         double agg_value_14 = -1.0;
/* 102 */         if (!agg_isNull_4) {
/* 103 */           agg_value_14 = (double) agg_value_4;
/* 104 */         }
/* 105 */         if (!agg_isNull_14) {
/* 106 */           agg_agg_isNull_13_0 = false;
/* 107 */           agg_value_13 = agg_value_14;
/* 108 */           continue;
/* 109 */         }
/* 110 */
/* 111 */         boolean agg_isNull_15 = false;
/* 112 */         double agg_value_15 = -1.0;
/* 113 */         if (!false) {
/* 114 */           agg_value_15 = (double) 0;
/* 115 */         }
/* 116 */         if (!agg_isNull_15) {
/* 117 */           agg_agg_isNull_13_0 = false;
/* 118 */           agg_value_13 = agg_value_15;
/* 119 */           continue;
/* 120 */         }
/* 121 */
/* 122 */       } while (false);
/* 123 */
/* 124 */       agg_isNull_11 = false; // resultCode could change nullability.
/* 125 */
/* 126 */       agg_value_11 = agg_bufValue_1 + agg_value_13;
/* 127 */
/* 128 */     }
/* 129 */     boolean agg_isNull_17 = false;
/* 130 */     long agg_value_17 = -1L;
/* 131 */     if (!false && agg_isNull_4) {
/* 132 */       agg_isNull_17 = agg_bufIsNull_2;
/* 133 */       agg_value_17 = agg_bufValue_2;
/* 134 */     } else {
/* 135 */       boolean agg_isNull_20 = true;
/* 136 */       long agg_value_20 = -1L;
/* 137 */
/* 138 */       if (!agg_bufIsNull_2) {
/* 139 */         agg_isNull_20 = false; // resultCode could change nullability.
/* 140 */
/* 141 */         agg_value_20 = agg_bufValue_2 + 1L;
/* 142 */
/* 143 */       }
/* 144 */       agg_isNull_17 = agg_isNull_20;
/* 145 */       agg_value_17 = agg_value_20;
/* 146 */     }
/* 147 */     // update aggregation buffer
/* 148 */     agg_bufIsNull_0 = false;
/* 149 */     agg_bufValue_0 = agg_value_6;
/* 150 */
/* 151 */     agg_bufIsNull_1 = agg_isNull_11;
/* 152 */     agg_bufValue_1 = agg_value_11;
/* 153 */
/* 154 */     agg_bufIsNull_2 = agg_isNull_17;
/* 155 */     agg_bufValue_2 = agg_value_17;
/* 156 */
/* 157 */   }
```
You can check the previous discussion in https://github.com/apache/spark/pull/19082

## How was this patch tested?
Existing tests

Closes #20965 from maropu/SPARK-21870-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-06 11:45:14 +08:00
Mukul Murthy 3929d16604 [SPARK-26046][SS] Add StreamingQueryManager.listListeners()
### What changes were proposed in this pull request?

Add a listListeners() method to StreamingQueryManager that lists all StreamingQueryListeners that have been added to that manager.

### Why are the changes needed?

While it's best practice to keep handles on all listeners added, it's still nice to have an API to be able to list what listeners have been added to a StreamingQueryManager.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified existing unit tests to use the new API instead of using reflection.

Closes #25518 from mukulmurthy/26046-listener.

Authored-by: Mukul Murthy <mukul.murthy@gmail.com>
Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>
2019-09-05 14:27:54 -07:00
Wenchen Fan c81fd0cd61 [SPARK-28974][SQL] centralize the Data Source V2 table capability checks
### What changes were proposed in this pull request?

merge the `V2WriteSupportCheck` and `V2StreamingScanSupportCheck` to one rule: `TableCapabilityCheck`.

### Why are the changes needed?

It's a little confusing to have 2 rules to check DS v2 table capability, while one rule says it checks write and another rule says it checks streaming scan. We can clearly tell it from the rule names that the batch scan check is missing.

It's better to have a centralized place for this check, with a name that clearly says it checks table capability.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #25679 from cloud-fan/dsv2-check.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-05 20:22:29 +08:00
HyukjinKwon 103d50b3f6 [SPARK-28272][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base
### What changes were proposed in this pull request?

This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base.

<details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
index f102383cb4d..eff33f280cf 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
 -3,7 +3,7

 -- !query 0
-select max(min(unique1)) from tenk1
+select udf(max(min(unique1))) from tenk1
 -- !query 0 schema
 struct<>
 -- !query 0 output
 -12,11 +12,11  It is not allowed to use an aggregate function in the argument of another aggreg

 -- !query 1
-select (select count(*)
-        from (values (1)) t0(inner_c))
+select udf((select udf(count(*))
+        from (values (1)) t0(inner_c))) as col
 from (values (2),(3)) t1(outer_c)
 -- !query 1 schema
-struct<scalarsubquery():bigint>
+struct<col:bigint>
 -- !query 1 output
 1
 1
```

</p>
</details>

### Why are the changes needed?

To improve test coverage in UDFs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested via:

```bash
 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql"
```

as guided in https://issues.apache.org/jira/browse/SPARK-27921

Closes #25676 from HyukjinKwon/SPARK-28272.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-05 18:35:21 +09:00
HyukjinKwon be04c97262 [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base
### What changes were proposed in this pull request?

This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base.

<details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
<p>

```diff
```

</p>
</details>

### Why are the changes needed?

To improve test coverage in UDFs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested via:

```bash
 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql"
```

as guided in https://issues.apache.org/jira/browse/SPARK-27921

Closes #25677 from HyukjinKwon/SPARK-28971.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-05 18:34:44 +09:00
Sean Owen 36559b6525 [SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc that partition column can be numeric, date or timestamp type
### What changes were proposed in this pull request?

`DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too.

### Why are the changes needed?

scaladoc is incorrect.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #25687 from srowen/SPARK-28977.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-05 18:32:45 +09:00
WeichenXu f8bc91f749 [SPARK-28782][SQL] Generator support in aggregate expressions
### What changes were proposed in this pull request?

Support generator in aggregate expressions.

In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e:
```
aggregate(with generator)
 |--child_plan
```
===>
```
project
  |--generator(resolved)
         |--aggregate
               |--child_plan
```

### Why are the changes needed?

We should support sql like:
```
select explode(array(min(a), max(a))) from t
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

Unit test added.

Closes #25512 from WeichenXu123/explode_bug.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-05 16:17:49 +08:00