Commit graph

26519 commits

Author SHA1 Message Date
Tathagata Das 481e5211d2
[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits
This PR solves two bugs related to streaming limits

**Bug 1 (SPARK-30658)**: Limit before a streaming aggregate (i.e. `df.limit(5).groupBy().count()`) in complete mode was not being planned as a stateful streaming limit. The planner rule planned a logical limit with a stateful streaming limit plan only if the query is in append mode. As a result, instead of allowing max 5 rows across batches, the planned streaming query was allowing 5 rows in every batch thus producing incorrect results.

**Solution**: Change the planner rule to plan the logical limit with a streaming limit plan even when the query is in complete mode if the logical limit has no stateful operator before it.

**Bug 2 (SPARK-30657)**: `LocalLimitExec` does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. `df.dropDuplicates().limit(5)`), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed).

**Solution**: Change the planner rule to always use a new `StreamingLocalLimitExec` which always fully consumes the iterator. This is the safest thing to do. However, this will introduce a performance regression as consuming the iterator is extra work. To minimize this performance impact, add an additional post-planner optimization rule to replace `StreamingLocalLimitExec` with `LocalLimitExec` when there is no stateful operator before the limit that could be affected by it.

No

Updated incorrect unit tests and added new ones

Closes #27373 from tdas/SPARK-30657.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-01-31 09:27:34 -08:00
yi.wu 5ccbb38a71 [SPARK-29938][SQL][FOLLOW-UP] Improve AlterTableAddPartitionCommand
All credit to Ngone51, Closes #27293.
### What changes were proposed in this pull request?
This PR improves `AlterTableAddPartitionCommand` by:
1. adds an internal config for partitions batch size to avoid hard code
2. reuse `InMemoryFileIndex.bulkListLeafFiles` to perform parallel file listing to improve code reuse

### Why are the changes needed?
Improve code quality.

### Does this PR introduce any user-facing change?
Yes. We renamed `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `spark.sql.parallelFileListingInCommands.enabled` as a side effect of this change.

### How was this patch tested?
Pass Jenkins.

Closes #27413 from xuanyuanking/SPARK-29938.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-01 01:03:00 +08:00
zebingl@fb.com 21bc0474bb [SPARK-30511][SPARK-28403][CORE] Don't treat failed/killed speculative tasks as pending in ExecutorAllocationManager
### What changes were proposed in this pull request?

Currently, when speculative tasks fail/get killed, they are still considered as pending and count towards the calculation of number of needed executors. To be more accurate: `stageAttemptToNumSpeculativeTasks(stageAttempt)` is incremented on onSpeculativeTaskSubmitted, but never decremented.  `stageAttemptToNumSpeculativeTasks -= stageAttempt` is performed on stage completion. **This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!**

This PR fixes this issue by updating `stageAttemptToSpeculativeTaskIndices` and  `stageAttemptToNumSpeculativeTasks` on speculative tasks completion.  This PR also addresses some other minor issues: scheduler behavior after receiving an intentionally killed task event; try to address [SPARK-28403](https://issues.apache.org/jira/browse/SPARK-28403).

### Why are the changes needed?

This has caused resource wastage in our production with speculation enabled. With aggressive speculation, we found data skewed jobs can hold hundreds of idle executors with less than 10 tasks running.

An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode):
```
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
    Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
    Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
```
You will see when running the last task, we would be hold 38 executors (see below), which is exactly (152 + 3) / 4 = 38.
![image](https://user-images.githubusercontent.com/9404831/72469112-9a7fac00-3793-11ea-8f50-74d0ab7325a4.png)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added a comprehensive unit test.

Test with the above repro shows that we are holding 2 executors at the end
![image](https://user-images.githubusercontent.com/9404831/72469177-bbe09800-3793-11ea-850f-4a2c67142899.png)

Closes #27223 from linzebing/speculation_fix.

Authored-by: zebingl@fb.com <zebingl@fb.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-01-31 08:49:34 -06:00
Thomas Graves 3d2b8d8b13 [SPARK-30638][CORE] Add resources allocated to PluginContext
### What changes were proposed in this pull request?

Add the allocated resources to parameters to the PluginContext so that any plugins in driver or executor could use this information to initialize devices or use this information in a useful manner.

### Why are the changes needed?

To allow users to initialize/track devices once at the executor level before each task runs to use them.

### Does this PR introduce any user-facing change?

Yes to the people using the Executor/Driver plugin interface.

### How was this patch tested?

Unit tests and manually by writing a plugin that initialized GPU's using this interface.

Closes #27367 from tgravescs/pluginWithResources.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-01-31 08:25:32 -06:00
HyukjinKwon 6f4703e22e [SPARK-30690][DOCS][BUILD] Add CalendarInterval into API documentation
### What changes were proposed in this pull request?

We should also expose it in documentation as we marked it as unstable API as of SPARK-30547
Note that, seems Javadoc -> Scaladoc doesn't work but this PR does not target to fix.

### Why are the changes needed?

To show the documentation of API.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually built the docs via `jykill serve` under `docs` directory:

![Screen Shot 2020-01-31 at 4 04 15 PM](https://user-images.githubusercontent.com/6477701/73519315-12143300-4444-11ea-9260-070c9f672dde.png)

Closes #27412 from HyukjinKwon/SPARK-30547.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-31 22:50:01 +09:00
Burak Yavuz 290a528bff [SPARK-30615][SQL] Introduce Analyzer rule for V2 AlterTable column change resolution
### What changes were proposed in this pull request?

Adds an Analyzer rule to normalize the column names used in V2 AlterTable table changes. We need to handle all ColumnChange operations. We add an extra match statement for future proofing new changes that may be added. This prevents downstream consumers (e.g. catalogs) to deal about case sensitivity or check that columns exist, etc.

We also fix the behavior for ALTER TABLE CHANGE COLUMN (Hive style syntax) for adding comments to complex data types. Currently, the data type needs to be provided as part of the Hive style syntax. This assumes that the data type as changed when it may have not and the user only wants to add a comment, which fails in CheckAnalysis.

### Why are the changes needed?

Currently we do not handle case sensitivity correctly for ALTER TABLE ALTER COLUMN operations.

### Does this PR introduce any user-facing change?

No, fixes a bug.

### How was this patch tested?

Introduced v2CommandsCaseSensitivitySuite and added a test around HiveStyle Change columns to PlanResolutionSuite

Closes #27350 from brkyvz/normalizeAlter.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-31 16:41:10 +08:00
Huaxin Gao 6fac411076 [SPARK-29093][ML][PYSPARK][FOLLOW-UP] Remove duplicate setter
### What changes were proposed in this pull request?
remove duplicate setter in ```BucketedRandomProjectionLSH```

### Why are the changes needed?
Remove the duplicate ```setInputCol/setOutputCol``` in ```BucketedRandomProjectionLSH``` because these two setter are already in super class ```LSH```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually checked.

Closes #27397 from huaxingao/spark-29093.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 23:36:39 -08:00
herman a5c7090ffa [SPARK-30671][SQL] emptyDataFrame should use a LocalRelation
### What changes were proposed in this pull request?
This PR makes `SparkSession.emptyDataFrame` use an empty local relation instead of an empty RDD. This allows to optimizer to recognize this as an empty relation, and creates the opportunity to do some more aggressive optimizations.

### Why are the changes needed?
It allows us to optimize empty dataframes better.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added a test case to `DataFrameSuite`.

Closes #27400 from hvanhovell/SPARK-30671.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-31 16:14:07 +09:00
Dongjoon Hyun 05be81d69e [SPARK-30192][SQL][FOLLOWUP] Rename SINGLETON to INSTANCE
### What changes were proposed in this pull request?

This PR renames a variable `SINGLETON` to `INSTANCE`.

### Why are the changes needed?

This is a minor change for consistency with the other parts.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the existing tests.

Closes #27409 from dongjoon-hyun/SPARK-30192.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 22:51:51 -08:00
Burak Yavuz 1cd19ad92d [SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming
### What changes were proposed in this pull request?

We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A ReadLimit defines how much data should be read in the next micro-batch. `SupportsAdmissionControl` specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example.

We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream.

### Why are the changes needed?

Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate limit the amount of data read for that micro-batch without this interface.

### Does this PR introduce any user-facing change?

DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once().

### How was this patch tested?

Existing tests, as this API is mostly internal

Closes #27380 from brkyvz/rateLimit.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2020-01-30 22:02:48 -08:00
sandeep katta 5f3ec6250f [SPARK-30362][CORE] Update InputMetrics in DataSourceRDD
### What changes were proposed in this pull request?
Incase of DS v2 InputMetrics are not updated

**Before Fix**
![inputMetrics](https://user-images.githubusercontent.com/35216143/71501010-c216df00-288d-11ea-8522-fdd50b13eae1.png)

**After Fix** we can see that `Input Size / Records` is updated in the UI
![image](https://user-images.githubusercontent.com/35216143/71501000-b88d7700-288d-11ea-92fe-a727b2b79908.png)

### Why are the changes needed?
InputMetrics like bytesread and recordread should be updated

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added UT and also verified manually

Closes #27021 from sandeep-katta/dsv2inputmetrics.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-31 14:01:32 +08:00
Wenchen Fan 9f42be25eb [SPARK-29665][SQL] refine the TableProvider interface
### What changes were proposed in this pull request?

Instead of having several overloads of `getTable` method in `TableProvider`, it's better to have 2 methods explicitly: `inferSchema` and `inferPartitioning`. With a single `getTable` method that takes everything: schema, partitioning and properties.

This PR also adds a `supportsExternalMetadata` method in `TableProvider`, to indicate if the source support external table metadata. If this flag is false:
1. spark.read.schema... is disallowed and fails
2. when we support creating v2 tables in session catalog,  spark only keeps table properties in the catalog.

### Why are the changes needed?

API improvement.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26868 from cloud-fan/provider2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-31 13:37:43 +08:00
Jungtaek Lim (HeartSaVioR) ca3a64bffb [SPARK-30481][CORE][FOLLOWUP] Execute log compaction only when merge application listing is successful
### What changes were proposed in this pull request?

This PR fixes a couple of minor issues on SPARK-30481:

* SHS runs "compaction" regardless of the result of "merge application listing".

If "merge application listing" fails, most likely the application log will have some issue and "compaction" won't work properly then. We can just skip trying compaction when "merge application listing" fails.

* When "compaction" throws exception we don't handle it.

It's expected to swallow exception, but we don't even log the exception for now. It should be logged properly.

### Why are the changes needed?

Described in above section.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #27408 from HeartSaVioR/SPARK-30481-FOLLOWUP-MINOR-FIXES.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 21:04:08 -08:00
Jungtaek Lim (HeartSaVioR) cbb714f67e [SPARK-29438][SS] Use partition ID of StateStoreAwareZipPartitionsRDD for determining partition ID of state store in stream-stream join
### What changes were proposed in this pull request?

Credit to uncleGen for discovering the problem and providing simple reproducer as UT. New UT in this patch is borrowed from #26156 and I'm retaining a commit from #26156 (except unnecessary part on this path) to properly give a credit.

This patch fixes the issue that partition ID could be mis-assigned when the query contains UNION and stream-stream join is placed on the right side. We assume the range of partition IDs as `(0 ~ number of shuffle partitions - 1)` for stateful operators, but when we use stream-stream join on the right side of UNION, the range of partition ID of task goes to `(number of partitions in left side, number of partitions in left side + number of shuffle partitions - 1)`, which `number of partitions in left side` can be changed in some cases (new UT points out the one of the cases).

The root reason of bug is that stream-stream join picks the partition ID from TaskContext, which wouldn't be same as partition ID from source if union is being used. Hopefully we can pick the right partition ID from source in StateStoreAwareZipPartitionsRDD - this patch leverages that partition ID.

### Why are the changes needed?

This patch will fix the broken of assumption of partition range on stateful operator, as well as fix the issue reported in JIRA issue SPARK-29438.

### Does this PR introduce any user-facing change?

Yes, if their query is using UNION and stream-stream join is placed on the right side. They may encounter the problem to read state from checkpoint and may need to discard checkpoint to continue.

### How was this patch tested?

Added UT which fails on current master branch, and passes with this patch.

Closes #26162 from HeartSaVioR/SPARK-29438.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2020-01-30 20:21:43 -08:00
Shixiong Zhu f56ba37d8b
[SPARK-30656][SS] Support the "minPartitions" option in Kafka batch source and streaming source v1
### What changes were proposed in this pull request?

- Add `minPartitions` support for Kafka Streaming V1 source.
- Add `minPartitions` support for Kafka batch V1  and V2 source.
- There is lots of refactoring (moving codes to KafkaOffsetReader) to reuse codes.

### Why are the changes needed?

Right now, the "minPartitions" option only works in Kafka streaming source v2. It would be great that we can support it in batch and streaming source v1 (v1 is the fallback mode when a user hits a regression in v2) as well.

### Does this PR introduce any user-facing change?

Yep. The `minPartitions` options is supported in Kafka batch and streaming queries for both data source V1 and V2.

### How was this patch tested?

New unit tests are added to test "minPartitions".

Closes #27388 from zsxwing/kafka-min-partitions.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-01-30 18:14:50 -08:00
Liang-Chi Hsieh 5916c7d0d0 [SPARK-30673][SQL][TESTS] Test cases in HiveShowCreateTableSuite should create Hive table
### What changes were proposed in this pull request?

This patch makes the test cases in HiveShowCreateTableSuite create Hive table instead of data source table.

### Why are the changes needed?

Because SparkSQL now creates data source table if no provider is specified in SQL command, some test cases in HiveShowCreateTableSuite don't create Hive table, but data source table.

It is confusing and not good for the purpose of this test suite.

### Does this PR introduce any user-facing change?

No, only test case.

### How was this patch tested?

Unit test.

Closes #27393 from viirya/SPARK-30673.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2020-01-30 13:23:58 -08:00
Huaxin Gao f59685acaa [SPARK-30662][ML][PYSPARK] ALS/MLP extend HasBlockSize
### What changes were proposed in this pull request?
Make ALS/MLP extend ```HasBlockSize```

### Why are the changes needed?

Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently.

ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```.

### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize``` and ```ALS.getBlockSize```
```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize```

### How was this patch tested?
Manually tested. Also added doctest.

Closes #27389 from huaxingao/spark-30662.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-30 13:13:10 -06:00
Wenchen Fan e5f572af06 [SPARK-30680][SQL] ResolvedNamespace does not require a namespace catalog
### What changes were proposed in this pull request?

Update `ResolvedNamespace` to accept catalog as `CatalogPlugin` not `SupportsNamespaces`.

This is extracted from https://github.com/apache/spark/pull/27345

### Why are the changes needed?

not all commands that need to resolve namespaces require a namespace catalog. For example, `SHOW TABLE` is implemented by `TableCatalog.listTables`, and is nothing to do with `SupportsNamespace`.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27403 from cloud-fan/ns.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 10:34:59 -08:00
Wenchen Fan 7503e76af0 [SPARK-30622][SQL] commands should return dummy statistics
### What changes were proposed in this pull request?

override `Command.stats` to return a dummy statistics (Long.Max).

### Why are the changes needed?

Commands are eagerly executed. They will be converted to LocalRelation after the DataFrame is created. That said, the statistics of a command is useless. We should avoid unnecessary statistics calculation of command's children.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes #27344 from cloud-fan/command.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 10:27:35 -08:00
Kazuaki Ishizaki b0db6231fd [SPARK-29020][FOLLOWUP][SQL] Update description of array_sort function
### What changes were proposed in this pull request?

This PR is a follow-up of #25728. #25728 introduces additional arguments to determine sort order. Thus, this function does not sort only in ascending order. However, the description was not updated.
This PR updates the description to follow the latest feature.

### Why are the changes needed?

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests since this PR just updates description text.

Closes #27404 from kiszk/SPARK-29020-followup.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 09:41:32 -08:00
Maxim Gekk a291433ed3 [SPARK-30678][MLLIB][TESTS] Eliminate warnings from deprecated BisectingKMeansModel.computeCost
### What changes were proposed in this pull request?
In the PR, I propose to replace deprecated method `computeCost` of `BisectingKMeansModel` by `summary.trainingCost`.

### Why are the changes needed?
The changes eliminate deprecation warnings:
```
BisectingKMeansSuite.scala:108: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn]     assert(model.computeCost(dataset) < 0.1)
BisectingKMeansSuite.scala:135: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn]     assert(model.computeCost(dataset) == summary.trainingCost)
BisectingKMeansSuite.scala:323: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
[warn]       model.computeCost(dataset)
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `BisectingKMeansSuite` via:
```
./build/sbt "test:testOnly *BisectingKMeansSuite"
```

Closes #27401 from MaxGekk/kmeans-computeCost-warning.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 09:05:14 -08:00
zhengruifeng 073ce12543 [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, use blocks instead of vectors
2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial

### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (40% ~ 92%)

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
updated testsuites

Closes #27374 from zhengruifeng/blockify_lor.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-30 10:52:07 -06:00
Dongjoon Hyun 561e9b9688 [SPARK-30674][INFRA] Use python3 in dev/lint-python
### What changes were proposed in this pull request?

This PR aims to use `python3` instead of `python` in `dev/lint-python`.

### Why are the changes needed?

Currently, `dev/lint-python` fails at Python 2. And, Python 2 is EOL since January 1st 2020.
```
$ python -V
Python 2.7.17

$ dev/lint-python
starting python compilation test...
Python compilation failed with the following errors:
Compiling ./python/setup.py ...
  File "./python/setup.py", line 27
    file=sys.stderr)
        ^
SyntaxError: invalid syntax
```

### Does this PR introduce any user-facing change?

No. This is a dev environment.

### How was this patch tested?

Jenkins is running this with Python 3 already.
The following is a manual test.

```
$ python -V
Python 3.8.0

$ dev/lint-python
starting python compilation test...
python compilation succeeded.
```

Closes #27394 from dongjoon-hyun/SPARK-30674.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 03:17:29 -08:00
Nicholas Chammas bda0669110 [SPARK-30665][DOCS][BUILD][PYTHON] Eliminate pypandoc dependency
### What changes were proposed in this pull request?

This PR removes any dependencies on pypandoc. It also makes related tweaks to the docs README to clarify the dependency on pandoc (not pypandoc).

### Why are the changes needed?

We are using pypandoc to convert the Spark README from Markdown to ReST for PyPI. PyPI now natively supports Markdown, so we don't need pypandoc anymore. The dependency on pypandoc also sometimes causes issues when installing Python packages that depend on PySpark, as described in #18981.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually:

```sh
python -m venv venv
source venv/bin/activate
pip install -U pip

cd python/
python setup.py sdist
pip install dist/pyspark-3.0.0.dev0.tar.gz
pyspark --version
```

I also built the PySpark and R API docs with `jekyll` and reviewed them locally.

It would be good if a maintainer could also test this by creating a PySpark distribution and uploading it to [Test PyPI](https://test.pypi.org) to confirm the README looks as it should.

Closes #27376 from nchammas/SPARK-30665-pypandoc.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-30 16:40:38 +09:00
angerszhu 246c398d59 [SPARK-30435][DOC] Update doc of Supported Hive Features
### What changes were proposed in this pull request?

add supported hive features

### Why are the changes needed?
update doc

### Does this PR introduce any user-facing change?
Before change UI info:

![image](https://user-images.githubusercontent.com/46485123/72592726-29302c80-393e-11ea-8f4d-76432d4cb658.png)

After this pr:
![image](https://user-images.githubusercontent.com/46485123/72593569-42d27380-3940-11ea-91c7-f2998d476364.png)

![image](https://user-images.githubusercontent.com/46485123/72962218-afd98380-3dee-11ea-82a1-0bf533ebfd9f.png)

### How was this patch tested?
For PR about Spark Doc Web UI, we need to show UI format before and after pr.
We can build our local web server about spark docs with reference `$SPARK_PROJECT/docs/README.md`

You should install python and ruby in your env and also install plugin like below
```sh
$ sudo gem install jekyll jekyll-redirect-from rouge
# Following is needed only for generating API docs
$ sudo pip install sphinx pypandoc mkdocs
$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "rmarkdown"), repos="https://cloud.r-project.org/")'
$ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", repos="https://cloud.r-project.org/")'
$ sudo Rscript -e 'devtools::install_version("testthat", version = "1.0.2", repos="https://cloud.r-project.org/")'
```

Then we call  `jekyll serve --watch` after build we see below message
```
~/Documents/project/AngersZhu/spark/sql
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
            Source: /Users/angerszhu/Documents/project/AngersZhu/spark/docs
       Destination: /Users/angerszhu/Documents/project/AngersZhu/spark/docs/_site
 Incremental build: disabled. Enable with --incremental
      Generating...
                    done in 24.717 seconds.
 Auto-regeneration: enabled for '/Users/angerszhu/Documents/project/AngersZhu/spark/docs'
    Server address: http://127.0.0.1:4000
  Server running... press ctrl-c to stop.
```

Visit   http://127.0.0.1:4000 to get your newest change in doc web.

Closes #27106 from AngersZhuuuu/SPARK-30435.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-29 20:55:29 -08:00
Nicholas Chammas c228810edc [SPARK-30672][BUILD] Add numpy to API docs readme
### What changes were proposed in this pull request?

This PR adds `numpy` to the list of things that need to be installed in order to build the API docs. It doesn't add a new dependency; it just documents an existing dependency.

### Why are the changes needed?

You cannot build the PySpark API docs without numpy installed. Otherwise you get this series of errors:

```
$ SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve
Configuration file: .../spark/docs/_config.yml
Moving to python/docs directory and building sphinx.
sphinx-build -b html -d _build/doctrees   . _build/html
Running Sphinx v2.3.1
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 0 source files that are out of date
updating environment: 0 added, 2 changed, 0 removed
reading sources... [100%] pyspark.mllib
WARNING: autodoc: failed to import module 'ml' from module 'pyspark'; the following exception was raised:
No module named 'numpy'
WARNING: autodoc: failed to import module 'ml.param' from module 'pyspark'; the following exception was raised:
No module named 'numpy'
...
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually, by building the API docs with and without numpy.

Closes #27390 from nchammas/SPARK-30672-numpy-pyspark-docs.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-30 13:04:53 +09:00
zero323 b1f81f0072 [MINOR][SQL][DOCS] Fix typos in scaladoc strings of higher order functions
### What changes were proposed in this pull request?

Fix following typos:

- tranformation -> transformation
- the boolean -> the Boolean
- signle -> single

### Why are the changes needed?

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Scala linter.

Closes #27382 from zero323/functions-typos.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-29 18:42:18 -06:00
Thomas Graves e5c7f89082 [SPARK-30529][CORE] Improve error messages when Executor dies before registering with driver
…

### What changes were proposed in this pull request?

If the resource discovery goes bad, like it doesn't return enough GPUs,  currently it just throws an exception. This is hard for users to see because you have to go find the executor logs and its not reported back to the driver.  On yarn if you exit explicitly then the driver logs show the error thrown and its much more useful.  On yarn with the explicit exit with non-zero it also goes against failed executor launch attempts and the application will eventually exit. so if its fundamentally a bad configuration or bad discovery script it won't just hang forever.   I also tested on k8s and standalone and the behaviors there don't change, the executor cleanly exit with an error message in the logs. The standalone ui makes it easy to see failed executors.

### Why are the changes needed?

better user experience.

### Does this PR introduce any user-facing change?

no api changes

### How was this patch tested?

ran unit tests and manually tested on yarn, standalone, and k8s.

Closes #27385 from tgravescs/SPARK-30529.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-29 15:37:11 -08:00
uncleGen 7173786153
[SPARK-29543][SS][UI] Structured Streaming Web UI
### What changes were proposed in this pull request?

This PR adds two pages to Web UI for Structured Streaming:
   - "/streamingquery": Streaming Query Page, providing some aggregate information for running/completed streaming queries.
  - "/streamingquery/statistics": Streaming Query Statistics Page, providing detailed information for streaming query, including `Input Rate`, `Process Rate`, `Input Rows`, `Batch Duration` and `Operation Duration`

![Screen Shot 2020-01-29 at 1 38 00 PM](https://user-images.githubusercontent.com/1000778/73399837-cd01cc80-429c-11ea-9d4b-1d200a41b8d5.png)
![Screen Shot 2020-01-29 at 1 39 16 PM](https://user-images.githubusercontent.com/1000778/73399838-cd01cc80-429c-11ea-8185-4e56db6866bd.png)

### Why are the changes needed?

It helps users to better monitor Structured Streaming query.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

- new added and existing UTs
- manual test

Closes #26201 from uncleGen/SPARK-29543.

Lead-authored-by: uncleGen <hustyugm@gmail.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Genmao Yu <hustyugm@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-01-29 13:43:51 -08:00
Chandni Singh 6b47ace27d [SPARK-30512] Added a dedicated boss event loop group
### What changes were proposed in this pull request?
Adding a dedicated boss event loop group to the Netty pipeline in the External Shuffle Service to avoid the delay in channel registration.
```
   EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
      conf.getModuleName() + "-boss");
    EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, conf.serverThreads(),
    conf.getModuleName() + "-server");

    bootstrap = new ServerBootstrap()
      .group(bossGroup, workerGroup)
      .channel(NettyUtils.getServerChannelClass(ioMode))
      .option(ChannelOption.ALLOCATOR, allocator)
```

### Why are the changes needed?
We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service.
```
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
	at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
	at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
	at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
	at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
	at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
```
The investigation that we have done is described here:
https://github.com/netty/netty/issues/9890

After adding `LoggingHandler` to the netty pipeline, we saw that the registration of the channel was getting delay which is because the worker threads are busy with the existing channels.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
We have tested the patch on our clusters and with a stress testing tool. After this change, we didn't see any SASL requests timing out. Existing unit tests pass.

Closes #27240 from otterc/SPARK-30512.

Authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-01-29 15:02:48 -06:00
Saurabh Chawla d0f635e3bc [SPARK-30582][WEBUI] Spark UI is not showing Aggregated Metrics by Executor in stage page
### What changes were proposed in this pull request?

There are scenarios where Spark History Server is located behind the VPC. So whenever api calls hit to get the executor Summary(allexecutors). There can be delay in getting the response of executor summary and in mean time "stage-page-template.html" is loaded and the response of executor Summary is not added to the stage-page-template.html.

As the result of which Aggregated Metrics by Executor in stage page is showing blank.

This scenario can be easily found in the cases when there is some proxy-server which is responsible for sending the request and response to spark History server.
This can be reproduced in Knox/In-house proxy servers which are used to send and receive response to Spark History Server.

Alternative scenario to test this case, Open the spark UI in developer mode in browser add some breakpoint in stagepage.js, this will add some delay in getting the response and now if we check the spark UI for stage Aggregated Metrics by Executor in stage page is showing blank.

So In-order to fix this there is a need to add the change in stagepage.js . There is a need to add the api call to get the html page(stage-page-template.html) first and after that other api calls to get the data that needs to attached in the stagepage (like executor Summary, stageExecutorSummaryInfoKeys exc)

### Why are the changes needed?
Since stage page is useful for debugging purpose, This helps in understanding how many task ran on the particular executor and information related to shuffle read and write on that executor.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually tested. Testing this in a reproducible way requires a running browser or HTML rendering engine that executes the JavaScript.Open the spark UI in developer mode in browser add some breakpoint in stagepage.js, this will add some delay in getting the response and now if we check the spark UI for stage Aggregated Metrics by Executor in stage page is showing blank.

Before fix

<img width="1529" alt="Screenshot 2020-01-20 at 3 21 55 PM" src="https://user-images.githubusercontent.com/34540906/72716739-bcfd3500-3b98-11ea-8dbe-90a135822f92.png">

After fix

<img width="1540" alt="Screenshot 2020-01-20 at 3 23 12 PM" src="https://user-images.githubusercontent.com/34540906/72716782-d30af580-3b98-11ea-8764-2bde77764604.png">

Closes #27292 from SaurabhChawla100/SPARK-30582.

Authored-by: Saurabh Chawla <saurabhc@qubole.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-29 08:49:45 -06:00
Dilip Biswal 3e203c985c [SPARK-28801][DOC][FOLLOW-UP] Setup links and address other review comments
### What changes were proposed in this pull request?

- Sets up links between related sections.
- Add "Related sections" for each section.
- Change to the left hand side menu to reflect the current status of the doc.
- Other minor cleanups.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Tested using jykyll build --serve

Closes #27371 from dilipbiswal/select_finalization.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-29 08:41:40 -06:00
Takeshi Yamamuro ec1fb6b4e1 [SPARK-30234][SQL][FOLLOWUP] Add .enabled in the suffix of the ADD FILE legacy option
### What changes were proposed in this pull request?

This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`.

### Why are the changes needed?

For consistent option names.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #27372 from maropu/SPARK-30234-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-29 12:23:59 +09:00
zero323 298d0a5102 [SPARK-23435][SPARKR][TESTS] Update testthat to >= 2.0.0
### What changes were proposed in this pull request?

- Update `testthat` to >= 2.0.0
- Replace of `testthat:::run_tests` with `testthat:::test_package_dir`
- Add trivial assertions for tests, without any expectations, to avoid skipping.
- Update related docs.

### Why are the changes needed?

`testthat` version has been frozen by [SPARK-22817](https://issues.apache.org/jira/browse/SPARK-22817) / https://github.com/apache/spark/pull/20003, but 1.0.2 is pretty old, and we shouldn't keep things in this state forever.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

- Existing CI pipeline:
     - Windows build on AppVeyor, R 3.6.2, testthtat 2.3.1
     - Linux build on Jenkins, R 3.1.x, testthat 1.0.2

- Additional builds with thesthat 2.3.1  using [sparkr-build-sandbox](https://github.com/zero323/sparkr-build-sandbox) on c7ed64af9e697b3619779857dd820832176b3be3

   R 3.4.4  (image digest ec9032f8cf98)
   ```
   docker pull zero323/sparkr-build-sandbox:3.4.4
   docker run zero323/sparkr-build-sandbox:3.4.4 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
    ```
    3.5.3 (image digest 0b1759ee4d1d)

    ```
    docker pull zero323/sparkr-build-sandbox:3.5.3
    docker run zero323/sparkr-build-sandbox:3.5.3 zero323 --branch SPARK-23435 --commit
    c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
    ```

   and 3.6.2 (image digest 6594c8ceb72f)
    ```
   docker pull zero323/sparkr-build-sandbox:3.6.2
   docker run zero323/sparkr-build-sandbox:3.6.2 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
   ````

   Corresponding [asciicast](https://asciinema.org/) are available as 10.5281/zenodo.3629431

     [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3629431.svg)](https://doi.org/10.5281/zenodo.3629431)

   (a bit to large to burden asciinema.org, but can run locally via `asciinema play`).

----------------------------

Continued from #27328

Closes #27359 from zero323/SPARK-23435.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-29 10:37:08 +09:00
Jungtaek Lim (HeartSaVioR) a2fe73b83c [SPARK-30481][CORE] Integrate event log compactor into Spark History Server
### What changes were proposed in this pull request?

This patch addresses remaining functionality on event log compaction: integrate compaction into FsHistoryProvider.

This patch is next task of SPARK-30479 (#27164), please refer the description of PR #27085 to see overall rationalization of this patch.

### Why are the changes needed?

One of major goal of SPARK-28594 is to prevent the event logs to become too huge, and SPARK-29779 achieves the goal. We've got another approach in prior, but the old approach required models in both KVStore and live entities to guarantee compatibility, while they're not designed to do so.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT.

Closes #27208 from HeartSaVioR/SPARK-30481.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
2020-01-28 17:16:21 -08:00
Dongjoon Hyun 580c2b7e34 [SPARK-27166][SQL][FOLLOWUP] Refactor to build string once
### What changes were proposed in this pull request?

This is a follow-up for https://github.com/apache/spark/pull/24098 to refactor to build string once according to the [review comment](https://github.com/apache/spark/pull/24098#discussion_r369845234)

### Why are the changes needed?

Previously, we chose the minimal change way.
In this PR, we choose a more robust way than the previous post-step string processing.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

The test case is extended with more cases.

Closes #27353 from dongjoon-hyun/SPARK-27166-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-28 12:48:16 -08:00
zhengruifeng 96d27274f5 [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors
### What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param `blockSize`;
3, add a new class `InstanceBlock`
4, standardize the input outside of optimization procedure;

### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS)

### Does this PR introduce any user-facing change?
a new param `blockSize`

### How was this patch tested?
existing and updated testsuites

Closes #27360 from zhengruifeng/blockify_svc.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-01-28 20:55:21 +08:00
Maxim Gekk 8aebc80e0e [SPARK-30625][SQL] Support escape as third parameter of the like function
### What changes were proposed in this pull request?
In the PR, I propose to transform the `Like` expression to `TernaryExpression`, and add third parameter `escape`. So, the `like` function will have feature parity with `LIKE ... ESCAPE` syntax supported by 187f3c1773.

### Why are the changes needed?
The `like` functions can be called with 2 or 3 parameters, and functionally equivalent to `LIKE` and `LIKE ... ESCAPE` SQL expressions.

### Does this PR introduce any user-facing change?
Yes, before `like` fails with the exception:
```sql
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7
```
After:
```sql
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
true
```

### How was this patch tested?
- Add new example for the `like` function which is checked by `SQLQuerySuite`
- Run `RegexpExpressionsSuite` and `ExpressionParserSuite`.

Closes #27355 from MaxGekk/like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-27 11:19:32 -08:00
Patrick Cording c5c580ba0d [SPARK-30633][SQL] Append L to seed when type is LongType
### What changes were proposed in this pull request?

Allow for using longs as seed for xxHash.

### Why are the changes needed?

Codegen fails when passing a seed to xxHash that is > 2^31.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests pass. Should more be added?

Closes #27354 from patrickcording/fix_xxhash_seed_bug.

Authored-by: Patrick Cording <patrick.cording@datarobot.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-27 10:32:15 -08:00
Jungtaek Lim (HeartSaVioR) 0436b3d3f8 [SPARK-30653][INFRA][SQL] EOL character enforcement for java/scala/xml/py/R files
### What changes were proposed in this pull request?

This patch converts CR/LF into LF in 3 source files, which most files are only using LF. This patch also add rules to enforce EOL as LF for all java, scala, xml, py, R files.

### Why are the changes needed?

The majority of source code files are using LF and only three files are CR/LF. While using IDE would let us don't bother with the difference, it still has a chance to make unnecessary diff if the file is modified with the editor which doesn't handle it automatically.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

```
grep -IUrl --color "^M" . | grep "\.java\|\.scala\|\.xml\|\.py\|\.R" | grep -v "/target/" | grep -v "/build/" | grep -v "/dist/" | grep -v "dependency-reduced-pom.xml" | grep -v ".pyc"
```

(Please note you'll need to type CTRL+V -> CTRL+M in bash shell to get `^M` because it's representing CR/LF, not a combination of `^` and `M`.)

Before the patch, the result is:

```
./sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java
./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala
```

and after the patch, the result is None.

And git shows WARNING message if EOL of any of source files in given types are modified to CR/LF, like below:

```
warning: CRLF will be replaced by LF in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala.
The file will have its original line endings in your working directory.
```

Closes #27365 from HeartSaVioR/MINOR-remove-CRLF-in-source-codes.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-27 10:20:51 -08:00
Dilip Biswal 8a24cf2bfe [SPARK-30588][DOC] Document CLUSTER BY Clause of SELECT statement in SQL Reference
### What changes were proposed in this pull request?
Document CLUSTER BY clause of SELECT statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="972" alt="Screen Shot 2020-01-20 at 2 59 05 PM" src="https://user-images.githubusercontent.com/14225158/72762704-7528de80-3b95-11ea-9d34-8fa0ab63d4c0.png">
<img width="972" alt="Screen Shot 2020-01-20 at 2 59 19 PM" src="https://user-images.githubusercontent.com/14225158/72762710-78bc6580-3b95-11ea-8279-2848d3b9e619.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #27297 from dilipbiswal/sql-ref-select-clusterby.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-27 08:59:48 -06:00
Dilip Biswal 5781e57127 [SPARK-30589][DOC] Document DISTRIBUTE BY Clause of SELECT statement in SQL Reference
### What changes were proposed in this pull request?
Document DISTRIBUTE BY clause of SELECT statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="972" alt="Screen Shot 2020-01-20 at 3 08 24 PM" src="https://user-images.githubusercontent.com/14225158/72763045-c08fbc80-3b96-11ea-8fb6-023cba5eb96a.png">
<img width="972" alt="Screen Shot 2020-01-20 at 3 08 34 PM" src="https://user-images.githubusercontent.com/14225158/72763047-c38aad00-3b96-11ea-80d8-cd3d2d4257c8.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #27298 from dilipbiswal/sql-ref-select-distributeby.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-27 08:58:45 -06:00
Dilip Biswal 7e1b991d12 [SPARK-30581][DOC] Document SORT BY Clause of SELECT statement in SQLReference
### What changes were proposed in this pull request?
Document SORT BY clause of SELECT statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="972" alt="Screen Shot 2020-01-20 at 1 25 57 AM" src="https://user-images.githubusercontent.com/14225158/72714701-00698c00-3b24-11ea-810e-28400e196ae9.png">
<img width="972" alt="Screen Shot 2020-01-20 at 1 26 11 AM" src="https://user-images.githubusercontent.com/14225158/72714706-02cbe600-3b24-11ea-9072-6d5e6f256400.png">
<img width="972" alt="Screen Shot 2020-01-20 at 1 26 28 AM" src="https://user-images.githubusercontent.com/14225158/72714712-07909a00-3b24-11ea-9aed-51b6bb0849f2.png">
<img width="972" alt="Screen Shot 2020-01-20 at 1 26 46 AM" src="https://user-images.githubusercontent.com/14225158/72714722-0a8b8a80-3b24-11ea-9fea-4d2a166e9d92.png">
<img width="972" alt="Screen Shot 2020-01-20 at 1 27 02 AM" src="https://user-images.githubusercontent.com/14225158/72714731-0f503e80-3b24-11ea-9f6d-8223e5d88c65.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #27289 from dilipbiswal/sql-ref-select-sortby.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-27 08:37:42 -06:00
Bryan Cutler 43d9c7e7e5 [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion
### What changes were proposed in this pull request?

Prevent unnecessary copies of data during conversion from Arrow to Pandas.

### Why are the changes needed?

During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/devarrow.apache.org/msg17008.html and ARROW-7596 for discussion.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests

Closes #27358 from BryanCutler/pyspark-pandas-timestamp-copy-fix-SPARK-30640.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-01-26 15:21:06 -08:00
Yuchen Huo d0800fc8e2 [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
### What changes were proposed in this pull request?

Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in checkAnalysis step.

### Why are the changes needed?

In data source v2, table implementations are all customized so we may not be able to get the resolved identifier from tables them selves. Therefore we encode the table and catalog information in DSV2Relation so no external changes are needed to make sure this information is available.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests in the following suites:
CatalogManagerSuite.scala
CatalogV2UtilSuite.scala
SupportsCatalogOptionsSuite.scala
PlanResolutionSuite.scala

Closes #26957 from yuchenhuo/SPARK-30314.

Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2020-01-26 12:59:24 -08:00
Xiao Li 48f647882a [SPARK-30644][SQL][TEST] Remove query index from the golden files of SQLQueryTestSuite
### What changes were proposed in this pull request?

This PR is to remove query index from the golden files of SQLQueryTestSuite

### Why are the changes needed?

Because the SQLQueryTestSuite's golden files have the query index for each query, removal of any query statement [except the last one] will generate many unneeded difference. This will make code review harder. The number of changed lines is misleading.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #27361 from gatorsmile/removeIndexNum.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 23:17:36 -08:00
Xiao Li d69ed9afdf Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp"
This reverts commit 1d20d13149.

Closes #27351 from gatorsmile/revertSPARK25496.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 21:34:12 -08:00
zero323 c3f17fa039 [SPARK-29777][FOLLOW-UP][SPARKR] Remove no longer valid test for recursive calls
### What changes were proposed in this pull request?

Disabling test for cleaning closure of recursive function.

### Why are the changes needed?

As of 9514b822a7 this test is no longer valid, and recursive calls, even simple ones:

```lead
  f <- function(x) {
    if(x > 0) {
      f(x - 1)
    } else {
      x
    }
  }
```

lead to

```
Error: node stack overflow
```

This is issue is silenced when tested with `testthat` 1.x (reason unknown), but cause failures when using `testthat` 2.x (issue can be reproduced outside test context).

Problem is known and tracked by [SPARK-30629](https://issues.apache.org/jira/browse/SPARK-30629)

Therefore, keeping this test active doesn't make sense, as it will lead to continuous test failures, when `testthat` is updated (https://github.com/apache/spark/pull/27359 / SPARK-23435).

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

CC falaki

Closes #27363 from zero323/SPARK-29777-FOLLOWUP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 21:16:22 -08:00
zero323 40b1f4d87e [SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file
### What changes were proposed in this pull request?

Reference data for "collect() support Unicode characters" has been moved to an external file, to make test OS and locale independent.

### Why are the changes needed?

As-is, embedded data is not properly encoded on Windows:

```
library(SparkR)
SparkR::sparkR.session()
Sys.info()
#           sysname           release           version
#         "Windows"      "Server x64"     "build 17763"
#          nodename           machine             login
# "WIN-5BLT6Q610KH"          "x86-64"   "Administrator"
#              user    effective_user
#   "Administrator"   "Administrator"

Sys.getlocale()

# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

lines <- c("{\"name\":\"안녕하세요\"}",
           "{\"name\":\"您好\", \"age\":30}",
           "{\"name\":\"こんにちは\", \"age\":19}",
           "{\"name\":\"Xin chào\"}")

system(paste0("cat ", jsonPath))
# {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
# {"name":"<U+60A8><U+597D>", "age":30}
# {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
# {"name":"Xin chào"}
# [1] 0

jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)

df <- read.df(jsonPath, "json")

printSchema(df)
# root
#  |-- _corrupt_record: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

head(df)
#              _corrupt_record age                                     name
# 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
# 2                       <NA>  30                         <U+60A8><U+597D>
# 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
# 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>
```
This can be reproduced outside tests (Windows Server 2019, English locale), and causes failures, when `testthat` is updated to 2.x (https://github.com/apache/spark/pull/27359). Somehow problem is not picked-up when test is executed on `testthat` 1.0.2.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Running modified test, manual testing.

### Note

Alternative seems to be to used bytes, but it hasn't been properly tested.

```
test_that("collect() support Unicode characters", {

  lines <- markUtf8(c(
    '{"name": "안녕하세요"}',
    '{"name": "您好", "age": 30}',
    '{"name": "こんにちは", "age": 19}',
    '{"name": "Xin ch\xc3\xa0o"}'
  ))

  jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
  writeLines(lines, jsonPath, useBytes = TRUE)

  expected <- regmatches(lines, regexec('(?<="name": ").*?(?=")', lines, perl = TRUE))

  df <- read.df(jsonPath, "json")
  rdf <- collect(df)
  expect_true(is.data.frame(rdf))

  rdf$name <- markUtf8(rdf$name)
  expect_equal(rdf$name[1], expected[[1]])
  expect_equal(rdf$name[2], expected[[2]])
  expect_equal(rdf$name[3], expected[[3]])
  expect_equal(rdf$name[4], expected[[4]])

  df1 <- createDataFrame(rdf)
  expect_equal(
    collect(
      where(df1, df1$name == expected[[2]])
    )$name,
    expected[[2]]
  )
})
```

Closes #27362 from zero323/SPARK-30645.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-26 12:59:53 +09:00
Dilip Biswal d5b92b24c4 [SPARK-30579][DOC] Document ORDER BY Clause of SELECT statement in SQL Reference
### What changes were proposed in this pull request?
Document ORDER BY clause of SELECT statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="972" alt="Screen Shot 2020-01-19 at 11 50 57 PM" src="https://user-images.githubusercontent.com/14225158/72708034-ac0bdf80-3b16-11ea-81f3-48d8087e4e98.png">
<img width="972" alt="Screen Shot 2020-01-19 at 11 51 14 PM" src="https://user-images.githubusercontent.com/14225158/72708042-b0d09380-3b16-11ea-939e-905b8c031608.png">
<img width="972" alt="Screen Shot 2020-01-19 at 11 51 33 PM" src="https://user-images.githubusercontent.com/14225158/72708050-b4fcb100-3b16-11ea-95d2-e4e302cace1b.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #27288 from dilipbiswal/sql-ref-select-orderby.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-01-26 09:11:33 +09:00