Commit graph

25353 commits

Author SHA1 Message Date
Maxim Gekk eecef75350 [SPARK-29355][SQL] Support timestamps subtraction
### What changes were proposed in this pull request?

Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`.

### Why are the changes needed?
To maintain feature parity with PostgreSQL which allows to get timestamp difference:
```sql
# select timestamp'today' - timestamp'yesterday';
 ?column?
----------
 1 day
(1 row)
```

### Does this PR introduce any user-facing change?
Yes, previously users got the following error from timestamp subtraction:
```sql
spark-sql> select timestamp'today' - timestamp'yesterday';
Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7;
'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)]
+- OneRowRelation
```
after the changes they should get an interval:
```sql
spark-sql> select timestamp'today' - timestamp'yesterday';
interval 1 days
```

### How was this patch tested?
- Added tests for `TimestampDiff` to `DateExpressionsSuite`
- By new test in `TypeCoercionSuite`.
- Enabled tests in `postgreSQL/timestamp.sql`.

Closes #26022 from MaxGekk/timestamp-diff.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-04 09:39:19 -07:00
Wenchen Fan 275e044ba8 [SPARK-29039][SQL] centralize the catalog and table lookup logic
### What changes were proposed in this pull request?

Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is
1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`.
2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`.

However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places.

This PR centralizes the catalog/table lookup logic by 3 rules.
1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command.
2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`.
3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore.

### Why are the changes needed?

Reduce duplicated code and make the catalog/table lookup logic consistent.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #25747 from cloud-fan/lookup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-04 16:21:13 +08:00
Yuanjian Li 93289b54f5 [SPARK-29203][TESTS][MINOR][FOLLOW UP] Add access modifier for sparkConf in SQLQueryTestSuite
### What changes were proposed in this pull request?
Add access modifier `protected` for `sparkConf` in SQLQueryTestSuite, because in the parent trait SharedSparkSession, it is protected.

### Why are the changes needed?
Code consistency.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UT.

Closes #26019 from xuanyuanking/SPARK-29203.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-04 16:54:47 +09:00
Gengliang Wang 91747bd91b [SPARK-29326][SQL] ANSI store assignment policy: throw exception on casting failure
### What changes were proposed in this pull request?

1. With ANSI store assignment policy,  an exception is thrown on casting failure
2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`.

### Why are the changes needed?

As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field.

### Does this PR introduce any user-facing change?

With ANSI store assignment policy,  an exception is thrown on casting failure

### How was this patch tested?

Unit test

Closes #25997 from gengliangwang/newCast.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-04 15:53:38 +08:00
DB Tsai 8b71e545d5 [SPARK-29351][CORE] Avoid Full Synchronization in ShuffleMapStage
### What changes were proposed in this pull request?

This PR use read/write locks instead of `synchronized`.

### Why are the changes needed?

In one of our production streaming jobs that has more than 1k executors, and each has 20 cores, Spark spends significant portion of time (30s) in sending out the `ShuffeStatus`. We find there are two issues.

1. In driver's message loop, it's calling `serializedMapStatus` which is in sync block. When the job scales really big, it can cause the contention.
2. When the job is big, the `MapStatus` is huge as well, the serialization time and compression time is slow.

This PR aims to address the first problem.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Test with existing test cases.

Closes #26017 from dbtsai/readWriteLock.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-10-04 03:14:10 +00:00
HyukjinKwon 0f48aafab8 [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)
### What changes were proposed in this pull request?

This PR proposes:

1. Use `is.data.frame` to check if it is a DataFrame.
2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

### Why are the changes needed?

1. To support SparkR with Arrow 0.14
2. To check if there's any regression and if it works correctly.

### Does this PR introduce any user-facing change?

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

**Before:**

```
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument
```

**After:**

```
   gear
1     5
2     5
3     5
4     4
5     4
6     4
7     4
8     5
9     5
...
```

### How was this patch tested?

AppVeyor

Closes #25993 from HyukjinKwon/arrow-r-appveyor.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-04 08:56:45 +09:00
maryannxue 8fabbab299 [SPARK-29350] Fix BroadcastExchange reuse in Dynamic Partition Pruning
### What changes were proposed in this pull request?
Dynamic partition pruning filters are added as an in-subquery containing a `BroadcastExchangeExec` in case of a broadcast hash join. This PR makes the `ReuseExchange` rule visit in-subquery nodes, to ensure the new `BroadcastExchangeExec` added by dynamic partition pruning can be reused.

### Why are the changes needed?
This initial dynamic partition pruning PR did not enable this reuse, which means a broadcast exchange would be executed twice, in the main query and in the DPP filter.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added broadcast exchange reuse check in `DynamicPartitionPruningSuite`

Closes #26015 from maryannxue/exchange-reuse.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-10-03 16:11:32 -07:00
zhouyongjin aedf090ab7 [SPARK-25468][WEBUI][FOLLOWUP] Current page index keep style with dataTable in the spark UI
### What changes were proposed in this pull request?
Current page index keep style with dataTable  in the spark UI.
https://issues.apache.org/jira/browse/SPARK-25468

move
.paginate_button.active > a {
    color: #999999;
    text-decoration: underline;
}

add
.paginate_button.active {
  border: 1px solid #979797 !important;
  background: white linear-gradient(to bottom, #fff 0%, #dcdcdc 100%);
}

### How was this patch tested?
![image](https://user-images.githubusercontent.com/8166352/65872683-b5303f80-e3b3-11e9-9785-9e1d15b0b3cf.png)
![image](https://user-images.githubusercontent.com/8166352/65872692-ba8d8a00-e3b3-11e9-8cf6-db6f905d3387.png)

Closes #25976 from YongjinZhou/master.

Authored-by: zhouyongjin <zhouyongjin@inspur.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-03 14:03:21 -05:00
Nik Vanderhoof 6f687691ef [SPARK-28962][SPARK-27297][SQL] Add overload for filter with index to functions object
### What changes were proposed in this pull request?
Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`.

### Why are the changes needed?
See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: https://github.com/apache/spark/pull/24232#issuecomment-533288653

### Does this PR introduce any user-facing change?

### How was this patch tested?
Updated the these test suites:

`test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite`
and
`org.apache.spark.sql.DataFrameFunctionsSuite`

Closes #26007 from nvander1/add_index_overload_for_filter.

Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2019-10-03 11:12:14 -07:00
Gabor Somogyi 6b5e0e2469 [SPARK-29054][SS] Invalidate Kafka consumer when new delegation token available
### What changes were proposed in this pull request?
Kafka consumers are cached. If delegation token is used and the token is expired, then exception is thrown. Such case new consumer is created in a Task retry with the latest delegation token. This can be enhanced by detecting the existence of a new delegation token. In this PR I'm detecting whether the token in the consumer is the same as the latest stored in the `UGI` (`targetServersRegex` must match not to create a consumer with another cluster's token).

### Why are the changes needed?
It would be good to avoid Task retry to pick up the latest delegation token.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing + new unit tests.
Additionally executed the following code snippet to measure `ensureConsumerHasLatestToken` time consumption:
```
    val startTimeNs = System.nanoTime()
    for (i <- 0 until 10000) {
      consumer.ensureConsumerHasLatestToken()
    }
    logInfo(s"It took ${TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTimeNs)} ms" +
      " to call ensureConsumerHasLatestToken 10000 times")
```

And here are the results:
```
19/09/11 14:58:22 INFO KafkaDataConsumerSuite: It took 1058 ms to call ensureConsumerHasLatestToken 10000 times
...
19/09/11 14:58:23 INFO KafkaDataConsumerSuite: It took 780 ms to call ensureConsumerHasLatestToken 10000 times
...
19/09/11 15:12:11 INFO KafkaDataConsumerSuite: It took 1032 ms to call ensureConsumerHasLatestToken 10000 times
...
19/09/11 15:12:11 INFO KafkaDataConsumerSuite: It took 679 ms to call ensureConsumerHasLatestToken 10000 times
```

Closes #25760 from gaborgsomogyi/SPARK-29054.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-03 09:34:31 -07:00
Dongjoon Hyun 4e0e4e51c4 [MINOR][TESTS] Rename JSONBenchmark to JsonBenchmark
### What changes were proposed in this pull request?

This PR renames `object JSONBenchmark` to `object JsonBenchmark` and the benchmark result file `JSONBenchmark-results.txt` to `JsonBenchmark-results.txt`.

### Why are the changes needed?

Since the file name doesn't match with `object JSONBenchmark`, it makes a confusion when we run the benchmark. In addition, this makes the automation difficult.
```
$ find . -name JsonBenchmark.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
```
```
$ build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JsonBenchmark"
[info] Running org.apache.spark.sql.execution.datasources.json.JsonBenchmark
[error] Error: Could not find or load main class org.apache.spark.sql.execution.datasources.json.JsonBenchmark
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is just renaming.

Closes #26008 from dongjoon-hyun/SPARK-RENAME-JSON.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 09:02:06 -07:00
Dongjoon Hyun 854a0f752e [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1)
### What changes were proposed in this pull request?

This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison.

**A. EXPECTED CASES(JDK11 is faster in general)**
- [x] BloomFilterBenchmark (JDK11 is faster except one case)
- [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC)
- [x] CSVBenchmark (JDK11 is faster except five cases)
- [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`)
- [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type)
- [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases)
- [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS)
- [x] HashedRelationMetricsBenchmark (JDK11 is faster)
- [x] JSONBenchmark (JDK11 is much faster except eight cases)
- [x] JoinBenchmark (JDK11 is faster except five cases)
- [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases)
- [x] PrimitiveArrayBenchmark (JDK11 is faster)
- [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case)
- [x] UDFBenchmark (N/A, values are too small)
- [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case)
- [x] WideTableBenchmark (JDK11 is faster except two cases)

**B. CASES WE NEED TO INVESTIGATE MORE LATER**
- [x] AggregateBenchmark (JDK11 is slower in general)
- [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`)
- [x] DataSourceReadBenchmark (JDK11 is slower in general)
- [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`)
- [x] MakeDateTimeBenchmark (JDK11 is slower except two cases)
- [x] MiscBenchmark (JDK11 is slower except ten cases)
- [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower)
- [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases)
- [x] RangeBenchmark (JDK11 is slower except one case)

`FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer.

### Why are the changes needed?

According to the result, there are some difference between JDK8/JDK11.
This will be a baseline for the future improvement and comparison. Also, as a reproducible  environment, the following environment is used.
- Instance: `r3.xlarge`
- OS: `CentOS Linux release 7.5.1804 (Core)`
- JDK:
  - `OpenJDK Runtime Environment (build 1.8.0_222-b10)`
  - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)`

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is a test-only PR. We need to run benchmark.

Closes #26003 from dongjoon-hyun/SPARK-29320.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 08:58:25 -07:00
Sean Owen 7aca0dd658 [SPARK-29296][BUILD][CORE] Remove use of .par to make 2.13 support easier; add scala-2.13 profile to enable pulling in par collections library separately, for the future
### What changes were proposed in this pull request?

Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use.

However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely.

### Why are the changes needed?

To compile for 2.13 and later to work with 2.13.

### Does this PR introduce any user-facing change?

Should not, no.

### How was this patch tested?

Existing tests.

Closes #25980 from srowen/SPARK-29296.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-03 08:56:08 -05:00
Liang-Chi Hsieh 2bc3fff13b [SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?

This patch upgrades cloudpickle to 1.0.0 version.

Main changes:

1. cleanup unused functions: 936f16fac8
2. Fix relative imports inside function body: 31ecdd6f57
3. Write kw only arguments to pickle: 6cb4718528

### Why are the changes needed?

We should include new bug fix like 6cb4718528, because users might use such python function in PySpark.

```python
>>> def f(a, *, b=1):
...   return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:>                                                        (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
    process()
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
    serializer.dump_stream(out_iter, outfile)
  File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```

After:

```python
>>> def f(a, *, b=1):
...   return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```

### Does this PR introduce any user-facing change?

Yes. This fixes two bugs when pickling Python functions.

### How was this patch tested?

Existing tests.

Closes #26009 from viirya/upgrade-cloudpickle.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 19:20:51 +09:00
zero323 858bf76e35 [SPARK-29142][PYTHON][ML][FOLLOWUP][DOC] Replace incorrect :py:attr: applications
### What changes were proposed in this pull request?

This PR replaces some references with correct ones (`:py:class:`).

### Why are the changes needed?

Newly added mixins from the original PR incorrectly reference classes with `:py:attr:`.  While these classes are marked as internal, and not rendered in the standard documentation, it still makes sense to use correct roles.

### Does this PR introduce any user-facing change?

No. The changed part is not a part of generated PySpark documents.

### How was this patch tested?

Since this PR is a kind of typo fix, manually checking the patch.
We can build document for compilation test although there is no UI change.

Closes #26004 from zero323/SPARK-29142-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 02:45:44 -07:00
s71955 ee66890f30 [SPARK-28084][SQL] Resolving the partition column name based on the resolver in sql load command
### What changes were proposed in this pull request?

LOAD DATA command resolves the partition column name as case sensitive manner,
where as in insert commandthe partition column name will be resolved using
the SQLConf resolver where the names will be resolved based on `spark.sql.caseSensitive` property. Same logic can be applied for resolving the partition column names in LOAD COMMAND.

### Why are the changes needed?

It's to handle the partition column name correctly according to the configuration.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UT and manual testing.

Closes #24903 from sujith71955/master_paritionColName.

Lead-authored-by: s71955 <sujithchacko.2010@gmail.com>
Co-authored-by: sujith71955 <sujithchacko.2010@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-03 01:11:48 -07:00
HyukjinKwon 40485f4656 [SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan
### What changes were proposed in this pull request?

This PR proposes to avoid abstract classes introduced at https://github.com/apache/spark/pull/24965 but instead uses trait and object.

- `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in

    **Before:**

    ```
    BasePythonRunner
    ├── BaseArrowPythonRunner
    │   ├── ArrowPythonRunner
    │   └── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner
    ```

    **After:**

    ```
    └── BasePythonRunner
        ├── ArrowPythonRunner
        ├── CoGroupedArrowPythonRunner
        ├── PythonRunner
        └── PythonUDFRunner
    ```
- `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple

    **Before:**

    ```
    └── BasePandasGroupExec
        ├── FlatMapGroupsInPandasExec
        └── FlatMapCoGroupsInPandasExec
    ```

    **After:**

    ```
    ├── FlatMapGroupsInPandasExec
    └── FlatMapCoGroupsInPandasExec
    ```

### Why are the changes needed?

The problem is that R code path is being matched with Python side:

**Python:**

```
└── BasePythonRunner
    ├── ArrowPythonRunner
    ├── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner
```

**R:**

```
└── BaseRRunner
    ├── ArrowRRunner
    └── RRunner
```

I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.

`BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.

`FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec`
`MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec`

In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Locally tested existing tests. Jenkins tests should verify this too.

Closes #25989 from HyukjinKwon/SPARK-29317.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 16:42:37 +09:00
angerszhu 178a1f3558 [SPARK-29305][BUILD] Update LICENSE and NOTICE for Hadoop 3.2
### What changes were proposed in this pull request?
This PR update LICENSE and NOTICE for Hadoop 3.2. Hadoop 3.2 newly added jars:

```
com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5
com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5
com.fasterxml.woodstox:woodstox-core:5.0.3
com.github.stephenc.jcip:jcip-annotations:1.0-1
com.google.re2j:re2j:1.1
com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7
com.nimbusds:nimbus-jose-jwt:4.41.1
dnsjava:dnsjava:2.1.7
net.minidev:accessors-smart:1.2
net.minidev:json-smart:2.3
org.apache.commons:commons-configuration2:2.1.1
org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
org.apache.hadoop:hadoop-hdfs-client:3.2.0
org.apache.kerby:kerb-admin:1.0.1
org.apache.kerby:kerb-client:1.0.1
org.apache.kerby:kerb-common:1.0.1
org.apache.kerby:kerb-core:1.0.1
org.apache.kerby:kerb-crypto:1.0.1
org.apache.kerby:kerb-identity:1.0.1
org.apache.kerby:kerb-server:1.0.1
org.apache.kerby:kerb-simplekdc:1.0.1
org.apache.kerby:kerb-util:1.0.1
org.apache.kerby:kerby-asn1:1.0.1
org.apache.kerby:kerby-config:1.0.1
org.apache.kerby:kerby-pkix:1.0.1
org.apache.kerby:kerby-util:1.0.1
org.apache.kerby:kerby-xdr:1.0.1
org.apache.kerby:token-provider:1.0.1
org.codehaus.woodstox:stax2-api:3.1.4
org.ehcache:ehcache:3.3.1
```
### Why are the changes needed?
We will distribute a binary release based on Hadoop 3.2 / Hive 2.3 in future.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #25978 from AngersZhuuuu/SPARK-29035.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-03 01:02:41 -05:00
Jungtaek Lim (HeartSaVioR) e44d191bf7 [SPARK-29322][CORE] Enable closeFrameOnFlush on ZstdOutputStream for event log file
### What changes were proposed in this pull request?

This patch proposes to enable `closeFrameOnFlush` for ZstdOutputStream specific to event logger, so that continuous input stream of zstd is not stuck when reading "inprogress" event log file.

The issue seems to be introduced from [SPARK-26283](https://issues.apache.org/jira/browse/SPARK-26283) which addressed some bug via reading event log file with enabling continuous mode, but it changed the behavior of input stream to read open frame, which seem to wait for frame to be closed. Enabling `closeFrameOnFlush` would close frame whenever flush is called, so input stream could read the frame sooner.

As a pair of `compressedContinuousInputStream`, this patch adds `compressedContinuousOutputStream` which will be only used for event logging.

### Why are the changes needed?

Without this patch, the reader thread in SHS is stuck on reading inprogress event log file compressed with zstd until the application becomes finished.

### Does this PR introduce any user-facing change?

It might bring some overhead on each flush when writing zstd compressed event log, so some sort of performance hit could be introduced. I've restricted the case to only event logging.

### How was this patch tested?

Manually tested, via setting Spark configuration as below:

```
spark.eventLog.enabled                     true
spark.eventLog.compress                  true
spark.eventLog.compression.codec zstd
```

and start Spark application. While the application is running, load the application in SHS webpage.

Before this patch, it may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck. After this patch, SHS can properly reads the inprogress event log file.

Closes #25996 from HeartSaVioR/SPARK-29322.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-02 20:48:38 -07:00
Henry D 51d6ba7490 [SPARK-28962][SQL] Provide index argument to filter lambda functions
### What changes were proposed in this pull request?

Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`.

### Why are the changes needed?
See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays.

### Does this PR introduce any user-facing change?
Previously filter lambdas had to look like
`filter(arr, el -> whatever)`

Now, lambdas can take an index argument as well
`filter(array, (el, idx) -> whatever)`

### How was this patch tested?
I added unit tests to `HigherOrderFunctionsSuite`.

Closes #25666 from henrydavidge/filter-idx.

Authored-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2019-10-02 13:03:06 -07:00
Nik Vanderhoof 730a17823f [SPARK-27297][SQL] Add higher order functions to scala API
## What changes were proposed in this pull request?

There is currently no existing Scala API equivalent for the higher order functions introduced in Spark 2.4.0.
 * transform
 * aggregate
 * filter
 * exists
 * forall
 * zip_with
 * map_zip_with
 * map_filter
 * transform_values
 * transform_keys

Equivalent column based functions should be added to the Scala API for org.apache.spark.sql.functions with the following signatures:

 
```scala
def transform(column: Column, f: Column => Column): Column = ???

def transform(column: Column, f: (Column, Column) => Column): Column = ???

def exists(column: Column, f: Column => Column): Column = ???

def filter(column: Column, f: Column => Column): Column = ???

def aggregate(
expr: Column,
zero: Column,
merge: (Column, Column) => Column,
finish: Column => Column): Column = ???

def aggregate(
expr: Column,
zero: Column,
merge: (Column, Column) => Column): Column = ???

def zip_with(
left: Column,
right: Column,
f: (Column, Column) => Column): Column = ???

def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ???

def transform_values(expr: Column, f: (Column, Column) => Column): Column = ???

def map_filter(expr: Column, f: (Column, Column) => Column): Column = ???

def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column = ???
```

## How was this patch tested?

I've mimicked the existing tests for the higher order functions in `org.apache.spark.sql.DataFrameFunctionsSuite` that use `expr` to test the higher order functions.

As an example of an existing test:
```scala
  test("map_zip_with function - map of primitive types") {
    val df = Seq(
      (Map(8 -> 6L, 3 -> 5L, 6 -> 2L), Map[Integer, Integer]((6, 4), (8, 2), (3, 2))),
      (Map(10 -> 6L, 8 -> 3L), Map[Integer, Integer]((8, 4), (4, null))),
      (Map.empty[Int, Long], Map[Integer, Integer]((5, 1))),
      (Map(5 -> 1L), null)
    ).toDF("m1", "m2")

    checkAnswer(df.selectExpr("map_zip_with(m1, m2, (k, v1, v2) -> k == v1 + v2)"),
      Seq(
        Row(Map(8 -> true, 3 -> false, 6 -> true)),
        Row(Map(10 -> null, 8 -> false, 4 -> null)),
        Row(Map(5 -> null)),
        Row(null)))
}
```

I've added this test that performs the same logic, but with the new column based API I've added.
```scala
    checkAnswer(df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2)),
      Seq(
        Row(Map(8 -> true, 3 -> false, 6 -> true)),
        Row(Map(10 -> null, 8 -> false, 4 -> null)),
        Row(Map(5 -> null)),
        Row(null)))
```

Closes #24232 from nvander1/feature/add_higher_order_functions_to_scala_api.

Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com>
Co-authored-by: Nik <nikolasrvanderhoof@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2019-10-02 12:53:39 -07:00
Dongjoon Hyun 9a84fae216 [SPARK-29332][BUILD] Update zstd-jni to 1.4.3-1
### What changes were proposed in this pull request?

This PR aims to update zstd-jni library to 1.4.3-1.

### Why are the changes needed?

This will bring the latest bug fixes in zstd itself. This is independent from another on-going Spark fix.
- https://github.com/facebook/zstd/releases/tag/v1.4.3

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #26002 from dongjoon-hyun/SPARK-29332.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-02 11:37:02 -07:00
huangweiyi 85dafabeb4 [SPARK-29273][CORE] Save peakExecutionMemory value when writing task end to event log
in TaskMetrics, there is a exposed metrics peakExecutionMemory, but the value never been set. when a task is finished, it generate a SparkListenerTaskEnd event info, incuding the metrics value. actually the peakExecutionMemory is stored in the Accumulables which is a member of TaskInfo.

so when parse the SparkListenerTaskEnd event, we can get the `internal.metrics.peakExecutionMemory` value from the parsed taskInfo object, and set back to the TaskMetricsInfo with supply a setPeakExecutionMemory method.

Closes #25949 from 012huang/fix-peakExecutionMemory-metrics-value.

Lead-authored-by: huangweiyi <huangweiyi_2006@qq.com>
Co-authored-by: hwy <huangweiyi_2006@qq.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-02 08:45:12 -07:00
Terry Kim f2ead4d0b5 [SPARK-28970][SQL] Implement USE CATALOG/NAMESPACE for Data Source V2
### What changes were proposed in this pull request?
This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#)

It also exposes `currentCatalog` in `CatalogManager`.

Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog).

### Why are the changes needed?
There is currently no mechanism to change current catalog/namespace thru SQL commands.

### Does this PR introduce any user-facing change?
Yes, you can perform the following:
```scala
// Sets the current catalog to 'testcat'
spark.sql("USE CATALOG testcat")

// Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'.
spark.sql("USE ns1.ns2 IN testcat")

// Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace.
spark.sql("SHOW NAMESPACES")
```

### How was this patch tested?
Added new unit tests.

Closes #25771 from imback82/use_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-02 21:55:21 +08:00
Maxim Gekk 3b1674cb1f [SPARK-29313][SQL] Fix failure on writing to noop in benchmarks
### What changes were proposed in this pull request?
In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks:
- JsonBenchmark
- CSVBenchmark
- UDFBenchmark
- MakeDateTimeBenchmark
- ExtractBenchmark
- DateTimeBenchmark
- NestedSchemaPruningBenchmark

### Why are the changes needed?
Otherwise writing to `noop` fails with:
```
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.;
[error] 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284)
```
most likely due to https://github.com/apache/spark/pull/25876

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
I generated results of `ExtractBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"
```

Closes #25988 from MaxGekk/noop-overwrite-mode.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-01 21:04:56 -07:00
Josh Rosen c6938eab57 [SPARK-29310][CORE][TESTS] TestMemoryManager should implement getExecutionMemoryUsageForTask()
### What changes were proposed in this pull request?

This PR updates `TestMemoryManager`, a class used only in unit tests, to override the `getExecutionMemoryUsageForTask()` and `releaseAllExecutionMemoryForTask()` methods. I also `synchronized` its state-accessing methods (to make the class thread-safe) and added some additional assertions to guard against freeing memory memory than has been allocated.

### Why are the changes needed?

Spark uses a `TestMemoryManager` class to mock out memory manager functionality in tests, allowing test authors to exercise control over certain behaviors (e.g. to simulate OOMs).

Several of Spark's test suites have memory-leak detection to ensure that all allocated memory is cleaned up at the end of each test case; this helps to guard against bugs that could cause production memory leaks. For example, see `testWithMemoryLeakDetection` in `UnsafeFixedWidthAggregationMapSuite`.

Unfortunately, however, this leak-detection logic is broken for tests which use TestMemoryManager because it does not override the `getExecutionMemoryUsageForTask()` method that is used by the leak-detection checks.

This PR fixes that problem, thereby strengthening our existing tests.

I spotted this problem while reviewing #25953: I tried introducing a change to remove a `freePage()` call (purposely inducing a memory leak) but no tests failed.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a new `TestMemoryManagerSuite`, with tests covering `TestMemoryManager` itself.

Ran a subset of existing tests on my laptop and uncovered a bug in one test's `free()` calls, plus missing cleanup calls in another test suite; both of these issues are fixed in this PR.

Closes #25985 from JoshRosen/SPARK-29310-testmemorymanager-getExecutionMemoryUsageForTask.

Lead-authored-by: Josh Rosen <rosenville@gmail.com>
Co-authored-by: joshrosen-stripe <48632449+joshrosen-stripe@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-02 11:07:19 +08:00
Sean Owen 2ec3265ae7 [MINOR][BUILD] Decode output of commands during merge script as UTF-8 consistently
### What changes were proposed in this pull request?

In the PR merge script, decode the raw output of subprocess commands like `git` using UTF-8 encoding, consistently.

### Why are the changes needed?

The merge script occasionally fails if run with Python 2 and the output of a command like `git` contains non-ASCII characters. I think this most usually happens when a user name, for example, contains Chinese characters.

This is because the output is decoded according to `sys.getdefaultencoding()`, which is ASCII in Python 2. It's UTF-8 in Python 3, by default.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

The change caused a merge that failed before to succeed.

Closes #25991 from srowen/MergePRUTF8.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-02 11:28:55 +09:00
Maxim Gekk e13880128d [SPARK-29311][SQL] Return seconds with fraction from date_part() and extract
### What changes were proposed in this pull request?

Added new expression `SecondWithFraction` which produces the `seconds` part of timestamps/dates with fractional part containing microseconds. This expression is used only in the `DatePart` expression. As the result, `date_part()` and `extract` return seconds and microseconds as the fractional part of the seconds part when `field` is `SECOND` (or synonyms).

### Why are the changes needed?

The `date_part()` and `extract` were added to maintain feature parity with PostgreSQL which has different behavior for the `SECOND` value of the `field` parameter. The fix is needed to behave in the same way. Here is PostgreSQL's output:
```sql
# SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
 date_part
-----------
  1.000001
(1 row)
```

### Does this PR introduce any user-facing change?
Yes, type of `date_part('SECOND', ...)` is changed from `INT` to `DECIMAL(8, 6)`.
Before:
```sql
spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001');
1
```
After:
```sql
spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001');
1.000001
```

### How was this patch tested?
- Added new tests to `DateExpressionSuite` for the `SecondWithFraction` expression
- Regenerated results of `date_part.sql`, `extract.sql` and `timestamp.sql`
- Updated results of `ExtractBenchmark`

Closes #25986 from MaxGekk/extract-seconds-from-timestamp.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-02 11:16:31 +09:00
Liang-Chi Hsieh 0cd436be66 [SPARK-29244][CORE] Prevent freed page in BytesToBytesMap free again
### What changes were proposed in this pull request?

When BytesToBytesMap cannot allocate a page, allocated page was freed by TaskMemoryManager. In this case, we should not keep it in longArray field of BytesToBytesMap. Otherwise it might be freed again in task completion listener of UnsafeFixedWidthAggregationMap and cause confusing error.

Note that because we catch Throwable when invoking completion listeners, this error should not affect other listeners, except for the current listener. In the completion listener of UnsafeFixedWidthAggregationMap, it only performs BytesToBytesMap.free().

BytesToBytesMap.free() does two things: freeing allocated pages and deleting spilled files. When it tries to free a freed page, this error hits and skips remaining pages (Executor.cleanUpAllAllocatedMemory will guard memory leak) and spilled files.

### Why are the changes needed?

By chance, it is possibly that we free an already freed page in BytesToBytesMap. Because we have some guards when freeing a page, a confusing error would be hit:

```
16:07:33.550 ERROR org.apache.spark.TaskContextImpl: Error in TaskCompletionListener
java.lang.AssertionError: Called freePage() on a memory block that has already been freed
        at org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332)
        at org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:129)
        at org.apache.spark.memory.MemoryConsumer.freeArray(MemoryConsumer.java:107)
        at org.apache.spark.unsafe.map.BytesToBytesMap.free(BytesToBytesMap.java:806)
        at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.free(UnsafeFixedWidthAggregationMap.java:226)
        at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.lambda$new$0(UnsafeFixedWidthAggregationMap.java:112)
        at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:119)
        at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:119)
        at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:132)
        at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:130)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:130)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:119)
        at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMapSuite.$anonfun$new$19(UnsafeFixedWidthAggregationMapSuite.scala:425)
        at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMapSuite.$anonfun$testWithMemoryLeakDetection$1(UnsafeFixedWidthAggregationMapSuite.scala:87)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
        at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
        at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
        at org.scalatest.Transformer.apply(Transformer.scala:22)
        at org.scalatest.Transformer.apply(Transformer.scala:20)
        at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
        at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
        at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
        at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
        at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
        at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
        at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
        at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
        at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
        at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
        at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
        at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
        at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
        at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
        at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
        at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
        at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
        at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
        at org.scalatest.Suite.run(Suite.scala:1147)
        at org.scalatest.Suite.run$(Suite.scala:1129)
        at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
        at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
        at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
        at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
        at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
        at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
        at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
        at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
        at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
        at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
        at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
        at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
        at sbt.ForkMain$Run$2.call(ForkMain.java:296)
        at sbt.ForkMain$Run$2.call(ForkMain.java:286)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[info] - SPARK-29244 *** FAILED *** (16 milliseconds)
[info]   org.apache.spark.util.TaskCompletionListenerException: Called freePage() on a memory block that has already been freed
[info]
[info] Previous exception in task: Unable to acquire 4096 bytes of memory, got 0
[info]  org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
[info]  org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
[info]  org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:790)
[info]  org.apache.spark.unsafe.map.BytesToBytesMap.reset(BytesToBytesMap.java:893)
[info]  org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:170)
[info]  org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:249)
[info]  org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMapSuite.$anonfun$new$19(UnsafeFixedWidthAggregationMapSuite.scala:421)
[info]  org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMapSuite.$anonfun$testWithMemoryLeakDetection$1(UnsafeFixedWidthAggregationMapSuite.scala:87)
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added unit test.

Closes #25953 from viirya/SPARK-29244.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-01 11:50:31 -07:00
Jungtaek Lim (HeartSaVioR) a4601cb44e [SPARK-29055][CORE] Update driver/executors' storage memory when block is removed from BlockManager
### What changes were proposed in this pull request?

This patch proposes to fix the issue that storage memory is not decreasing even block is removed in BlockManager. Originally the issue is found while removed broadcast doesn't reflect the storage memory on driver/executors.

AppStatusListener expects the value of memory in events on block update as "delta" so that it adjusts driver/executors' storage memory based on delta, but when removing block BlockManager reports the delta as 0, so the storage memory is not decreased. `BlockManager.dropFromMemory` deals with this correctly, so some of path of freeing memory has been updated correctly.

### Why are the changes needed?

The storage memory in metrics in AppStatusListener is now out of sync which lets end users easy to confuse as memory leak is happening.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Modified UTs. Also manually tested via running simple query repeatedly and observe executor page of Spark UI to see the value of storage memory is decreasing as well.

Please refer the description of [SPARK-29055](https://issues.apache.org/jira/browse/SPARK-29055) to get simple reproducer.

Closes #25973 from HeartSaVioR/SPARK-29055.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-01 09:41:51 -07:00
angerszhu 0cf2f48dfe [SPARK-29022][SQL] Fix SparkSQLCLI can not add jars by AddJarCommand
### What changes were proposed in this pull request?
For issue mentioned in [SPARK-29022](https://issues.apache.org/jira/browse/SPARK-29022)
Spark SQL CLI can't use class as serde class in jars add by SQL `ADD JAR`.

When we create table with `serde` class contains by jar added by SQL 'ADD JAR'.
We can create table with `serde` class construct success since we call `HiveClientImpl.createTable` under `withHiveState` method, it will add `clientLoader.classLoader` to `HiveClientImpl.state.getConf.classLoader`.

Jars added by SQL `ADD JAR` will be add to
1. `sparkSession.sharedState.jarClassLoader`.
2. 'HiveClientLoader.clientLoader.classLoader'

In Current spark-sql MODE,  `HiveClientImpl.state`  will use CliSessionState created when initialize
SparkSQLCliDriver,  When we select data from table, it will check `serde` class, when call method `HiveTableScanExec#addColumnMetadataToConf()` to check for table desc serde class.

```
val deserializer = tableDesc.getDeserializerClass.getConstructor().newInstance()
    deserializer.initialize(hiveConf, tableDesc.getProperties)
```

`getDeserializer` will  use CliSessionState's hiveConf's classLoader in `Spark SQL CLI` mode.
But when we call `ADD JAR` in spark, the jar won't be added to `Classloader of  CliSessionState' conf `, then `ClassNotFound` error happen.

So we reset `CliSessionState conf's classLoader ` to `sharedState.jarClassLoader` when `sharedState.jarClassLoader` has added jar passed by `HIVEAUXJARS`
Then when we use  `ADD JAR `  to add jar,  jar path will be added to CliSessionState's conf's ClassLoader

### Why are the changes needed?
Fix bug

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
ADD UT

Closes #25729 from AngersZhuuuu/SPARK-29015.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-01 10:09:29 -05:00
Dongjoon Hyun bd031c2173 [SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings
### What changes were proposed in this pull request?

This PR aims to remove `scalatest` deprecation warnings with the following changes.
- `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar`
- `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser`
- `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers`
- `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks`

### Why are the changes needed?

According to the Jenkins logs, there are 118 warnings about this.
```
 grep "is deprecated" ~/consoleText | grep scalatest | wc -l
     118
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

After Jenkins passes, we need to check the Jenkins log.

Closes #25982 from dongjoon-hyun/SPARK-29307.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 21:00:11 -07:00
Jeff Evans d841b33ba3 [SPARK-25153][SQL] Improve error messages for columns with dots/periods
### What changes were proposed in this pull request?

Check schema fields to see if they contain the exact column name, add to error message in DataSet#resolve

Add test for extra error message piece

Adds an additional check in `DataSet#resolve`, in the else clause (i.e. column not resolved), that appends a suffix to the error message for the `AnalysisException` if that column name is literally found in the schema fields, to suggest to the user that it might need to be quoted via backticks.

### Why are the changes needed?

Forgetting to quote such column names is a common occurrence for new Spark users.

### Does this PR introduce any user-facing change?

No (other than the extra suffix on the error message).

### How was this patch tested?

`test` was run for `core` in `sbt`, and passed.

Closes #25807 from jeff303/SPARK-25153.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2019-09-30 18:34:44 -07:00
Dongjoon Hyun a0b3d7a323 [SPARK-29300][TESTS] Compare catalyst and avro module benchmark in JDK8/11
### What changes were proposed in this pull request?

This PR regenerate the benchmark results in `catalyst` and `avro` module in order to compare JDK8/JDK11 result.

### Why are the changes needed?

This PR aims to verify that there is no regression on JDK11.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is a test-only update. We need to run the benchmark manually.

Closes #25972 from dongjoon-hyun/SPARK-29300.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 17:59:43 -07:00
Sean Owen e1ea806b30 [SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13
### What changes were proposed in this pull request?

Scala 2.13 emits a deprecation warning for procedure-like declarations:

```
def foo() {
 ...
```

This is equivalent to the following, so should be changed to avoid a warning:

```
def foo(): Unit = {
  ...
```

### Why are the changes needed?

It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.

Unfortunately, that makes this quite a big change.

### Does this PR introduce any user-facing change?

No behavior change at all.

### How was this patch tested?

Existing tests.

Closes #25968 from srowen/SPARK-29291.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 10:03:23 -07:00
Chris Martin 76791b89f5 [SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF
Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon.

Specifically:

- Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated.
- Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key.
- Added the boilerplate for doctests to cogroup.py.  Note that cogroup.py only contains the apply() function which has doctests disabled as per the  other Pandas Udfs.
- Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package.
- Some minor  formatting tweaks.

This was tested by running the appropriate unit tests.  I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check.

Closes #25939 from d80tb7/SPARK-27463-fixes.

Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-30 22:25:35 +09:00
Jungtaek Lim (HeartSaVioR) 39eb79ac4b [SPARK-28074][SS] Log warn message on possible correctness issue for multiple stateful operations in single query
## What changes were proposed in this pull request?

Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch.

This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users.

This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative.

## How was this patch tested?

Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query.

Closes #24890 from HeartSaVioR/SPARK-28074.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-30 08:18:23 -05:00
gengjiaan 1018390542 [SPARK-29252][BUILD] Upgrade zookeeper to 3.4.14 and fix vulnerabilities
### What changes were proposed in this pull request?
The current code uses org.apache.zookeeper:zookeeper:jar:3.4.6 and it will cause a security vulnerabilities. We could get some security info from https://www.tenable.com/cve/CVE-2019-0201

This reference remind to upgrate the version of `zookeeper` to 3.4.14 or later.

### Why are the changes needed?
This PR fix the security vulnerabilities.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Exists UT.

Closes #25933 from beliefer/upgrade-zookeeper.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-30 08:16:32 -05:00
Sean Owen 28b8383a6c [SPARK-29289][BUILD] Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators for 2.13
### What changes were proposed in this pull request?

Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators to latest maintenance release that is also cross-published for Scala 2.13.

### Why are the changes needed?

To build in the future for Scala 2.13

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #25967 from srowen/SPARK-29289.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-30 08:13:57 -05:00
Dongjoon Hyun 989b0c773f [SPARK-29297][TESTS] Compare core/mllib module benchmarks in JDK8/11
### What changes were proposed in this pull request?

This PR regenerate the benchmark results in `core` and `mllib` module in order to compare JDK8/JDK11 result.

### Why are the changes needed?

According to the result, For `PropertiesCloneBenchmark` and `UDTSerializationBenchmark`, JDK11 is slightly faster. In general, there is no regression in JDK11.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is a test-only PR. Manually run the benchmark.

Closes #25969 from dongjoon-hyun/SPARK-29297.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-29 21:43:58 -07:00
Liang-Chi Hsieh dd92e15301 [SPARK-29186][SQL] AliasIdentifier should be converted to Json in prettyJson
### What changes were proposed in this pull request?

This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson.

### Why are the changes needed?

When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like:

```
[ {
   "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
   "num-children" : 1,
   "name" : null,
   "child" : 0
 }, {
   "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
 ...
```

Seems the alias name was in the Json before SPARK-19602.

It is fixed by this patch:

```
[ {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
  "num-children" : 1,
  "name" : {
    "product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier",
    "identifier" : "t1"
  },
  "child" : 0
}, {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```

### Does this PR introduce any user-facing change?

Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier.

### How was this patch tested?

Added unit test.

Closes #25959 from viirya/SPARK-29186.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-09-29 20:00:13 -07:00
Unknown 3ea9d6825b [SPARK-29019][WEBUI] Improve tooltip JDBC/ODBC Server tab
### What changes were proposed in this pull request?
Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand.
We have documented it at SPARK-28373 but I think it is better to have some tooltips in the SQL statistics table to explain the columns
![image](https://user-images.githubusercontent.com/12819544/64489775-38e48980-d257-11e9-868a-5f5f6a0f1e46.png)
The columns with new tooltips are finish time, close time, execution time and duration
![image](https://user-images.githubusercontent.com/12819544/64489858-1141f100-d258-11e9-9e4e-fae3299da465.png)
Improvements in UIUtils can be used in other tables in WebUI to add tooltips

### Why are the changes needed?
It is interesting to improve the undestanding of the WebUI

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit tests are added and manual test.

Closes #25723 from planga82/feature/SPARK-29019_tooltipjdbcServer.

Lead-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-29 18:34:24 -05:00
Dongjoon Hyun df28671800 [SPARK-29282][TESTS] Use the same VM configurations for test/benchmark
### What changes were proposed in this pull request?

This PR aims to specify the JDK8 default configurations `-XX:+UseParallelGC -XX:-UseDynamicNumberOfGCThreads` explicitly. As we see in this PR [here](https://github.com/apache/spark/pull/25966/files#diff-12b89b7ee67c63c2254b749c8f8d0694R10), this will make the comparison between JDK8 and JDK11 easier by removing a misleading regression.

**NOTE THAT THESE JVM CONFS ARE ONLY FOR BENCHMARK COMPARISON, NOT FOR A PRODUCTION**

### Why are the changes needed?

There exists many JVM-level changes between JDK8 and JDK11. For example, the followings are notable changes and it turns out that especially (1) and (2) shows a misleading regression in our micro-benchmark environment because our microbenchmark uses small VM memory.
1. [JEP 248: Make G1 the Default Garbage Collector](https://bugs.openjdk.java.net/browse/JDK-8073273) **JDK9+**
2. [Enable UseDynamicNumberOfGCThreads by default](https://bugs.openjdk.java.net/browse/JDK-8198547) **JDK11+**
3. [Change default value of HeapSizePerGCThread](https://bugs.openjdk.java.net/browse/JDK-8200417) **JDK11+**

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is a test-only JVM configuration change. Manually, run the benchmark.

Closes #25966 from dongjoon-hyun/SPARK-29282.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-29 15:11:46 -07:00
angerszhu 1d4b2f010b [SPARK-29247][SQL] Redact sensitive information in when construct HiveClientHive.state
### What changes were proposed in this pull request?

HiveClientImpl may be log sensitive information. e.g. url, secret and token:
```scala
      logDebug(
        s"""
           |Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
           |$k=${if (k.toLowerCase(Locale.ROOT).contains("password")) "xxx" else v}
         """.stripMargin)
```
So redact it.  Use SQLConf.get.redactOptions.

I add a new overloading function to fit this situation for one by one kv pair situation.

### Why are the changes needed?
Redact sensitive information when construct HiveClientImpl

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
MT

Run command
` /sbin/start-thriftserver.sh`

In log we can get
```
19/09/28 08:27:02 main DEBUG HiveClientImpl:
Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
hive.druid.metadata.password=*********(redacted)
```

Closes #25954 from AngersZhuuuu/SPARK-29247.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-29 14:30:32 -07:00
Yuming Wang 31700116d2 [SPARK-28476][SQL] Support ALTER DATABASE SET LOCATION
### What changes were proposed in this pull request?
Support the syntax of `ALTER (DATABASE|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax.

Ref:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://issues.apache.org/jira/browse/HIVE-8472

### Why are the changes needed?
Support more syntax.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Unit test.

Closes #25883 from wangyum/SPARK-28476.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-29 11:31:49 -07:00
TomokoKomiyama 67d5b9b157 [SPARK-29172][SQL] Fix some exception issue of explain commands
### What changes were proposed in this pull request?
Added try exception

### Why are the changes needed?
The behaviors of run commands during exception handling are different depends on explain command. I think it should be unified.
[ >spark.sql("explain cost select * from hoge").show(false) ]
![cost](https://user-images.githubusercontent.com/55128575/65225389-09a80500-db00-11e9-9246-0f1a3a881595.png)

[ >spark.sql("explain extended select * from hoge").show(false) ]
![extemded](https://user-images.githubusercontent.com/55128575/65225430-188eb780-db00-11e9-99bf-ff550b2ffd12.png)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
tested manually

Closes #25848 from TomokoKomiyama/fix-explain.

Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-29 10:41:57 -05:00
Yuming Wang 8167714cab [SPARK-27831][FOLLOW-UP][SQL][TEST] Should not use maven to add Hive test jars
### What changes were proposed in this pull request?

This PR moves Hive test jars(`hive-contrib-*.jar` and `hive-hcatalog-core-*.jar`) from maven dependency to local file.

### Why are the changes needed?
`--jars` can't be tested since `hive-contrib-*.jar` and `hive-hcatalog-core-*.jar` are already in classpath.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test

Closes #25690 from wangyum/SPARK-27831-revert.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-28 16:55:49 -07:00
Maxim Gekk 2409320d8f [SPARK-29237][SQL][FOLLOWUP] Ignore SET commands in expression examples while checking the _FUNC_ pattern
### What changes were proposed in this pull request?

The `SET` commands do not contain the `_FUNC_` pattern a priori. In the PR, I propose filter out such commands in the `using _FUNC_ instead of function names in examples` test.

### Why are the changes needed?
After the merge of https://github.com/apache/spark/pull/25942, examples will require particular settings. Currently, the whole expression example has to be ignored which is so much. It makes sense to ignore only `SET` commands in expression examples.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

By running the `using _FUNC_ instead of function names in examples` test.

Closes #25958 from MaxGekk/dont-check-_FUNC_-in-set.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-29 08:51:47 +09:00
Jungtaek Lim (HeartSaVioR) 94946e4836 [SPARK-29281][SQL] Correct example of Like/RLike to test the origin intention correctly
### What changes were proposed in this pull request?

This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default.

Please take a look at current example of Like:

d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)

If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail.

```
The escape character is '\'. If an escape character precedes a special symbol or another
escape character, the following character is matched literally. It is invalid to escape
any other character.
```

For the query

```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%';
```

SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike):

> LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%`

which are no longer having origin intention (see left).

Below query tests the origin intention:

```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';
```

> LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%`

Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`.

Same for RLIKE:

```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*';
```

> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`

which is OK, but

```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
```

> RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.*`

which no longer haves origin intention.

Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.*';
```

> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`

### Why are the changes needed?

Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue.

### Does this PR introduce any user-facing change?

No, as it only corrects documentation.

### How was this patch tested?

Added debug log (like above) and ran queries from `spark-sql`.

Closes #25957 from HeartSaVioR/SPARK-29281.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-29 03:05:49 +09:00
Maxim Gekk ece4213176 [SPARK-21914][FOLLOWUP][TEST-HADOOP3.2][TEST-JAVA11] Clone SparkSession per each function example
### What changes were proposed in this pull request?
In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel.

### Why are the changes needed?
This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example:
```
org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U';
      at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48)
      at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57)
      at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108)
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite`

Closes #25956 from MaxGekk/fix-expr-examples-checks.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-29 02:57:55 +09:00