Commit graph

2547 commits

Author SHA1 Message Date
Emil Sandstø 0bdadba5e3 [SPARK-29790][DOC] Note required port for Kube API
It adds a note about the required port of a master url in Kubernetes.

Currently a port needs to be specified for the Kubernetes API. Also in case the API is hosted on the HTTPS port. Else the driver might fail with https://medium.com/kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d

Yes, a change to the "Running on Kubernetes" guide.

None - Documentation change

Closes #26426 from Tapped/patch-1.

Authored-by: Emil Sandstø <emilalexer@hotmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-08 09:33:07 -08:00
HyukjinKwon 4ec04e5ef3 [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's
## What changes were proposed in this pull request?

This PR proposes to add **Single threading model design (pinned thread model)** mode which is an experimental mode to sync threads on PVM and JVM. See https://www.py4j.org/advanced_topics.html#using-single-threading-model-pinned-thread

### Multi threading model

Currently, PySpark uses this model. Threads on PVM and JVM are independent. For instance, in a different Python thread, callbacks are received and relevant Python codes are executed. JVM threads are reused when possible.

Py4J will create a new thread every time a command is received and there is no thread available. See the current model we're using - https://www.py4j.org/advanced_topics.html#the-multi-threading-model

One problem in this model is that we can't sync threads on PVM and JVM out of the box. This leads to some problems in particular at some codes related to threading in JVM side. See:
7056e004ee/core/src/main/scala/org/apache/spark/SparkContext.scala (L334)
Due to reusing JVM threads, seems the job groups in Python threads cannot be set in each thread as described in the JIRA.

### Single threading model design (pinned thread model)

This mode pins and syncs the threads on PVM and JVM to work around the problem above. For instance, in the same Python thread, callbacks are received and relevant Python codes are executed. See https://www.py4j.org/advanced_topics.html#the-single-threading-model

Even though this mode can sync threads on PVM and JVM for other thread related code paths,
 this might cause another problem: seems unable to inherit properties as below (assuming multi-thread mode still creates new threads when existing threads are busy, I suspect this issue already exists when multiple jobs are submitted in multi-thread mode; however, it can be always seen in single threading mode):

```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```

```python
import threading

spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
    print(spark.sparkContext.getLocalProperty("a"))

threading.Thread(target=print_prop).start()
```

```
None
```

Unlike Scala side:

```scala
spark.sparkContext.setLocalProperty("a", "hi")
new Thread(new Runnable {
  def run() = println(spark.sparkContext.getLocalProperty("a"))
}).start()
```

```
hi
```

This behaviour potentially could cause weird issues but this PR currently does not target this fix this for now since this mode is experimental.

### How does this PR fix?

Basically there are two types of Py4J servers `GatewayServer` and `ClientServer`.  The former is for multi threading and the latter is for single threading. This PR adds a switch to use the latter.

In Scala side:
The logic to select a server is encapsulated in `Py4JServer` and use `Py4JServer` at `PythonRunner` for Spark summit and `PythonGatewayServer` for Spark shell. Each uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise.

In Python side:
Simply do an if-else to switch the server to talk. It uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise.

This is disabled by default for now.

## How was this patch tested?

Manually tested. This can be tested via:

```python
PYSPARK_PIN_THREAD=true ./bin/pyspark
```

and/or

```bash
cd python
./run-tests --python-executables=python --testnames "pyspark.tests.test_pin_thread"
```

Also, ran the Jenkins tests with `PYSPARK_PIN_THREAD` enabled.

Closes #24898 from HyukjinKwon/pinned-thread.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-08 06:44:58 +09:00
Wenchen Fan 9b61f90987 [SPARK-29761][SQL] do not output leading 'interval' in CalendarInterval.toString
### What changes were proposed in this pull request?

remove the leading "interval" in `CalendarInterval.toString`.

### Why are the changes needed?

Although it's allowed to have "interval" prefix when casting string to int, it's not recommended.

This is also consistent with pgsql:
```
cloud0fan=# select interval '1' day;
 interval
----------
 1 day
(1 row)
```

### Does this PR introduce any user-facing change?

yes, when display a dataframe with interval type column, the result is different.

### How was this patch tested?

updated tests.

Closes #26401 from cloud-fan/interval.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-07 15:44:50 +08:00
Kent Yao 4615769736 [SPARK-29603][YARN] Support application priority for YARN priority scheduling
### What changes were proposed in this pull request?

Priority for YARN to define pending applications ordering policy, those with higher priority have a better opportunity to be activated. YARN CapacityScheduler only.

### Why are the changes needed?

Ordering pending spark apps
### Does this PR introduce any user-facing change?

add a conf
### How was this patch tested?

add ut

Closes #26255 from yaooqinn/SPARK-29603.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-06 10:12:27 -08:00
Aman Omer 0dcd739534 [SPARK-29462] The data type of "array()" should be array<null>
### What changes were proposed in this pull request?
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tested manually

Closes #26324 from amanomer/29462.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-06 18:39:46 +09:00
Alessandro Bellina 3cb18d90c4 [SPARK-29151][CORE] Support fractional resources for task resource scheduling
### What changes were proposed in this pull request?
This PR adds the ability for tasks to request fractional resources, in order to be able to execute more than 1 task per resource. For example, if you have 1 GPU in the executor, and the task configuration is 0.5 GPU/task, the executor can schedule two tasks to run on that 1 GPU.

### Why are the changes needed?
Currently there is no good way to share a resource such that multiple tasks can run on a single unit. This allows multiple tasks to share an executor resource.

### Does this PR introduce any user-facing change?
Yes: There is a configuration change where `spark.task.resource.[resource type].amount` can now be fractional.

### How was this patch tested?
Unit tests and manually on standalone mode, and yarn.

Closes #26078 from abellina/SPARK-29151.

Authored-by: Alessandro Bellina <abellina@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2019-11-05 08:57:43 -06:00
Jungtaek Lim (HeartSaVioR) ba2bc4b0e0 [SPARK-20568][SS] Provide option to clean up completed files in streaming query
## What changes were proposed in this pull request?

This patch adds the option to clean up files which are completed in previous batch.

`cleanSource` -> "archive" / "delete" / "off"

The default value is "off", which Spark will do nothing.

If "delete" is specified, Spark will simply delete input files. If "archive" is specified, Spark will require additional config `sourceArchiveDir` which will be used to move input files to there. When archiving (via move) the path of input files are retained to the archived paths as sub-path.

Note that it is only applied to "micro-batch", since for batch all input files must be kept to get same result across multiple query executions.

## How was this patch tested?

Added UT. Manual test against local disk as well as HDFS.

Closes #22952 from HeartSaVioR/SPARK-20568.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-04 15:16:10 -08:00
Marcelo Vanzin d51d228048 [SPARK-29397][CORE] Extend plugin interface to include the driver
Spark 2.4 added the ability for executor plugins to be loaded into
Spark (see SPARK-24918). That feature intentionally skipped the
driver to keep changes small, and also because it is possible to
load code into the Spark driver using listeners + configuration.

But that is a bit awkward, because the listener interface does not
provide hooks into a lot of Spark functionality. This change reworks
the executor plugin interface to also extend to the driver.

- there's a "SparkPlugin" main interface that provides APIs to
  load driver and executor components.
- custom metric support (added in SPARK-28091) can be used by
  plugins to register metrics both in the driver process and in
  executors.
- a communication channel now exists that allows the plugin's
  executor components to send messages to the plugin's driver
  component easily, using the existing Spark RPC system.

The latter was a feature intentionally left out of the original
plugin design (also because it didn't include a driver component).

To avoid polluting the "org.apache.spark" namespace, I added the new
interfaces to the "org.apache.spark.api" package, which seems like
a better place in any case. The actual implementation is kept in
an internal package.

The change includes unit tests for the new interface and features,
but I've also been running a custom plugin that extends the new
API in real applications.

Closes #26170 from vanzin/SPARK-29397.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-04 14:33:17 -08:00
shivusondur eee45f83c6 [SPARK-28809][DOC][SQL] Document SHOW TABLE in SQL Reference
### What changes were proposed in this pull request?
Added the document reference for SHOW TABLE EXTENDED sql command

### Why are the changes needed?
For User reference

### Does this PR introduce any user-facing change?
yes, it provides document reference for SHOW TABLE EXTENDED sql command

### How was this patch tested?
verified in snap
<details>
<summary> Attached the Snap</summary>

![image](https://user-images.githubusercontent.com/7912929/68142029-b4f80680-ff54-11e9-99a0-f39f2dac09e4.png)
![image](https://user-images.githubusercontent.com/7912929/64019738-95f08900-cb4d-11e9-9769-ee2be926fdc1.png)
![image](https://user-images.githubusercontent.com/7912929/64019775-ab65b300-cb4d-11e9-9e7e-140616af7790.png)
![image](https://user-images.githubusercontent.com/7912929/67963910-65000380-fc25-11e9-9cd0-8ee43bf206b1.png)
</details>

Closes #25632 from shivusondur/jiraSHOWTABLE.

Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-04 11:58:41 -06:00
shivusondur f29a979e42 [SPARK-28798][DOC][SQL] Document DROP TABLE/VIEW statement in SQL Reference
### What changes were proposed in this pull request?
Added doc for DROP TABLE and DROP VIEW sql command

### Why are the changes needed?
For reference DROP TABLE  or DROP VIEW in spark-sql

### Does this PR introduce any user-facing change?
It updates DROP TABLE  or DROP VIEW reference doc

### How was this patch tested?
<details>
<summary> Attached the Snap</summary>

DROP TABLE

![image](https://user-images.githubusercontent.com/7912929/67884038-2443b400-fb6b-11e9-9773-b21dae398789.png)
![image](https://user-images.githubusercontent.com/7912929/67797387-aa96c200-faa7-11e9-90d4-fa8b7c6a4ec7.png)

DROP VIEW
![image](https://user-images.githubusercontent.com/7912929/67797463-c306dc80-faa7-11e9-96ec-e2f2e89d0db8.png)
![image](https://user-images.githubusercontent.com/7912929/67797648-1ed16580-faa8-11e9-9d32-19106326e3d9.png)

</details>

Closes #25533 from shivusondur/jiraUSEDB.

Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-04 11:52:19 -06:00
Wenchen Fan 31ae446e9c [SPARK-29623][SQL] do not allow multiple unit TO unit statements in interval literal syntax
### What changes were proposed in this pull request?

re-arrange the parser rules to make it clear that multiple unit TO unit statement like `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH` is not allowed.

### Why are the changes needed?

This is definitely an accident that we support such a weird syntax in the past. It's not supported by any other DBs and I can't think of any use case of it. Also no test covers this syntax in the current codebase.

### Does this PR introduce any user-facing change?

Yes, and a migration guide item is added.

### How was this patch tested?

new tests.

Closes #26285 from cloud-fan/syntax.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-02 21:35:56 +08:00
Terry Kim 3175f4bf1b [SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala
### What changes were proposed in this pull request?

This PR changes the behavior of `Column.getItem` to call `Column.getItem` on Scala side instead of `Column.apply`.

### Why are the changes needed?

The current behavior is not consistent with that of Scala.

In PySpark:
```Python
df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col.getItem(col('id'))).show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+
```
In Scala:
```Scala
val df = spark.range(2)
val map_col = map(lit(0), lit(100), lit(1), lit(200))
// The following getItem results in the following exception, which is the right behavior:
// java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id
//  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
//  at org.apache.spark.sql.Column.getItem(Column.scala:856)
//  ... 49 elided
df.withColumn("mapped", map_col.getItem(col("id"))).show
```

### Does this PR introduce any user-facing change?

Yes. If the use wants to pass `Column` object to `getItem`, he/she now needs to use the indexing operator to achieve the previous behavior.

```Python
df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col[col('id'))].show()
# +---+------+
# | id|mapped|
# +---+------+
# |  0|   100|
# |  1|   200|
# +---+------+
```

### How was this patch tested?

Existing tests.

Closes #26351 from imback82/spark-29664.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-01 12:25:48 +09:00
Chris Martin c29494377b [SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide
This PR adds some extra documentation for the new Cogrouped map Pandas udfs.  Specifically:

- Updated the usage guide for the new `COGROUPED_MAP` Pandas udfs added in https://github.com/apache/spark/pull/24981
- Updated the docstring for pandas_udf to include the COGROUPED_MAP type as suggested by HyukjinKwon in https://github.com/apache/spark/pull/25939

Closes #26110 from d80tb7/SPARK-29126-cogroup-udf-usage-guide.

Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-31 10:41:57 +09:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Dongjoon Hyun d417113c25 [SPARK-29668][DOCS] Deprecate Python 3 prior to version 3.6
### What changes were proposed in this pull request?

This PR aims to deprecate `Python 3.4 ~ 3.5`, which is prior to version 3.6 additionally.

### Why are the changes needed?

Since `Python 3.8` is already out, we will focus on to support Python 3.6/3.7/3.8.

### Does this PR introduce any user-facing change?

Yes. It's highly recommended to use Python 3.6/3.7. We will verify Python 3.8 before Apache Spark 3.0.0 release.

### How was this patch tested?

NA (This is a doc-only change).

Closes #26326 from dongjoon-hyun/SPARK-29668.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 12:31:23 -07:00
uncleGen 5f1ef544f3 [MINOR][DOCS] Use proper html tag in markdown
### What changes were proposed in this pull request?
This PR fix and use proper html tag in docs

### Why are the changes needed?

Fix documentation format error.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #26302 from uncleGen/minor-doc.

Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 15:30:58 +09:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
redsk 8bd8f492ea [SPARK-29500][SQL][SS] Support partition column when writing to Kafka
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-29500

`KafkaRowWriter` now supports setting the Kafka partition by reading a "partition" column in the input dataframe.

Code changes in commit nr. 1.
Test changes in commit nr. 2.
Doc changes in commit nr. 3.

tcondie dongjinleekr srowen

### Why are the changes needed?
While it is possible to configure a custom Kafka Partitioner with
`.option("kafka.partitioner.class", "my.custom.Partitioner")`, this is not enough for certain use cases. See the Jira issue.

### Does this PR introduce any user-facing change?
No, as this behaviour is optional.

### How was this patch tested?
Two new UT were added and one was updated.

Closes #26153 from redsk/feature/SPARK-29500.

Authored-by: redsk <nicola.bova@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-25 08:06:36 -05:00
Dongjoon Hyun 7417c3e7d5 [SPARK-29597][DOCS] Deprecate old Java 8 versions prior to 8u92
### What changes were proposed in this pull request?

This PR aims to deprecate old Java 8 versions prior to 8u92.

### Why are the changes needed?

This is a preparation to use JVM Option `ExitOnOutOfMemoryError`.
- https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html

### Does this PR introduce any user-facing change?

Yes. It's highly recommended for users to use the latest JDK versions of Java 8/11.

### How was this patch tested?

NA (This is a doc change).

Closes #26249 from dongjoon-hyun/SPARK-29597.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 20:51:31 -07:00
Pavithra Ramachandran 1ec1b2bd17 [SPARK-28791][DOC] Documentation for Alter table Command
What changes were proposed in this pull request?
Document ALTER TABLE statement in SQL Reference Guide.

Why are the changes needed?
Adding documentation for SQL reference.

Does this PR introduce any user-facing change?
yes

Before:
There was no documentation for this.

After.
![1](https://user-images.githubusercontent.com/51401130/65674372-1087c800-e06a-11e9-9155-ac70b419b069.png)
![2](https://user-images.githubusercontent.com/51401130/65674384-14b3e580-e06a-11e9-9c57-bca566dfdbc2.png)
![3](https://user-images.githubusercontent.com/51401130/65674391-18e00300-e06a-11e9-950a-6cc948dedd7d.png)
![4](https://user-images.githubusercontent.com/51401130/65674397-1bdaf380-e06a-11e9-87b0-b1523a745f83.png)
![5](https://user-images.githubusercontent.com/51401130/65674406-209fa780-e06a-11e9-8440-7e8105a77117.png)
![6](https://user-images.githubusercontent.com/51401130/65674417-23020180-e06a-11e9-8fff-30511836bb08.png)

How was this patch tested?
Used jekyll build and serve to verify.

Closes #25590 from PavithraRamachandran/alter_doc.

Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-24 08:19:03 -05:00
HyukjinKwon df00b5c17d [SPARK-29569][BUILD][DOCS] Copy and paste minified jquery instead when post-processing badges in JavaDoc
### What changes were proposed in this pull request?

This PR fixes our documentation build to copy minified jquery file instead.

The original file `jquery.js` seems missing as of Scala 2.12 upgrade. Scala 2.12 seems started to use minified `jquery.min.js` instead.

Since we dropped Scala 2.11, we won't have to take care about legacy `jquery.js` anymore.

Note that, there seem multiple weird stuff in the current ScalaDoc (e.g., some pages are weird, it starts from `scala.collection.*` or some pages are missing, or some docs are truncated, some badges look missing). It needs a separate double check and investigation.

This PR targets to make the documentation generation pass in order to unblock Spark 3.0 preview.

### Why are the changes needed?

To fix and make our official documentation build able to run.

### Does this PR introduce any user-facing change?

It will enable to build the documentation in our official way.

**Before:**

```
Making directory api/scala
cp -r ../target/scala-2.12/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Updating JavaDoc files for badge post-processing
Copying jquery.js from Scala API to Java API for page post-processing of badges
jekyll 3.8.6 | Error:  No such file or directory  rb_sysopen - ./api/scala/lib/jquery.js
```

**After:**

```
Making directory api/scala
cp -r ../target/scala-2.12/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Updating JavaDoc files for badge post-processing
Copying jquery.min.js from Scala API to Java API for page post-processing of badges
Copying api_javadocs.js to Java API for page post-processing of badges
Appending content of api-javadocs.css to JavaDoc stylesheet.css for badge styles
...
```

### How was this patch tested?

Manually tested via:

```
SKIP_PYTHONDOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
```

Closes #26228 from HyukjinKwon/SPARK-29569.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-23 15:23:25 +02:00
Terry Kim c128ac564d [SPARK-29511][SQL] DataSourceV2: Support CREATE NAMESPACE
### What changes were proposed in this pull request?

This PR adds `CREATE NAMESPACE` support for V2 catalogs.

### Why are the changes needed?

Currently, you cannot explicitly create namespaces for v2 catalogs.

### Does this PR introduce any user-facing change?

The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
```
to create a namespace `ns` inside `mycatalog` V2 catalog.

### How was this patch tested?

Added unit tests.

Closes #26166 from imback82/create_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-23 12:17:20 +08:00
Dilip Biswal c1c64851ed [SPARK-28793][DOC][SQL] Document CREATE FUNCTION in SQL Reference
### What changes were proposed in this pull request?
Document CREATE FUNCTION statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1260" alt="Screen Shot 2019-09-22 at 3 01 52 PM" src="https://user-images.githubusercontent.com/14225158/65395036-5bdc6680-dd4a-11e9-9873-0a1da88706a8.png">
<img width="1260" alt="Screen Shot 2019-09-22 at 3 02 11 PM" src="https://user-images.githubusercontent.com/14225158/65395037-5bdc6680-dd4a-11e9-964f-c02d23803b68.png">
<img width="1260" alt="Screen Shot 2019-09-22 at 3 02 39 PM" src="https://user-images.githubusercontent.com/14225158/65395038-5bdc6680-dd4a-11e9-831b-6ba1d041893d.png">
<img width="1260" alt="Screen Shot 2019-09-22 at 3 04 04 PM" src="https://user-images.githubusercontent.com/14225158/65395040-5bdc6680-dd4a-11e9-8226-250f77dfeaf3.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25894 from dilipbiswal/sql-ref-create-function.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-22 08:56:44 -05:00
Huaxin Gao 877993847c [SPARK-28787][DOC][SQL] Document LOAD DATA statement in SQL Reference
### What changes were proposed in this pull request?
Document LOAD DATA statement in SQL Reference

### Why are the changes needed?
To complete the SQL Reference

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Tested using jykyll build --serve

Here are the screen shots:

![image](https://user-images.githubusercontent.com/13592258/64073167-e7cd0800-cc4e-11e9-9fcc-92fe4cb5a942.png)

![image](https://user-images.githubusercontent.com/13592258/64073169-ee5b7f80-cc4e-11e9-9a36-cc023bcd32b1.png)

![image](https://user-images.githubusercontent.com/13592258/64073170-f4516080-cc4e-11e9-9101-2609a01fe6fe.png)

Closes #25522 from huaxingao/spark-28787.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-22 08:55:37 -05:00
denglingang 467c3f610f [SPARK-29529][DOCS] Remove unnecessary orc version and hive version in doc
### What changes were proposed in this pull request?

This PR remove unnecessary orc version and hive version in doc.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A.

Closes #26146 from denglingang/SPARK-24576.

Lead-authored-by: denglingang <chitin1027@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 14:49:23 +09:00
Yuming Wang 9e42c52c77 [MINOR][DOCS] Fix incorrect EqualNullSafe symbol in sql-migration-guide.md
### What changes were proposed in this pull request?
This PR fixes the incorrect `EqualNullSafe` symbol in `sql-migration-guide.md`.

### Why are the changes needed?
Fix documentation error.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #26163 from wangyum/EqualNullSafe-symbol.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-18 10:58:17 -05:00
Jungtaek Lim (HeartSaVioR) 100fc58da5 [SPARK-28869][CORE] Roll over event log files
### What changes were proposed in this pull request?

This patch is a part of [SPARK-28594](https://issues.apache.org/jira/browse/SPARK-28594) and design doc for SPARK-28594 is linked here: https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing

This patch proposes adding new feature to event logging, rolling event log files via configured file size.

Previously event logging is done with single file and related codebase (`EventLoggingListener`/`FsHistoryProvider`) is tightly coupled with it. This patch adds layer on both reader (`EventLogFileReader`) and writer (`EventLogFileWriter`) to decouple implementation details between "handling events" and "how to read/write events from/to file".

This patch adds two properties, `spark.eventLog.rollLog` and `spark.eventLog.rollLog.maxFileSize` which provides configurable behavior of rolling log. The feature is disabled by default, as we only expect huge event log for huge/long-running application. For other cases single event log file would be sufficient and still simpler.

### Why are the changes needed?

This is a part of SPARK-28594 which addresses event log growing infinitely for long-running application.

This patch itself also provides some option for the situation where event log file gets huge and consume their storage. End users may give up replaying their events and want to delete the event log file, but given application is still running and writing the file, it's not safe to delete the file. End users will be able to delete some of old files after applying rolling over event log.

### Does this PR introduce any user-facing change?

No, as the new feature is turned off by default.

### How was this patch tested?

Added unit tests, as well as basic manual tests.

Basic manual tests - ran SHS, ran structured streaming query with roll event log enabled, verified split files are generated as well as SHS can load these files, with handling app status as incomplete/complete.

Closes #25670 from HeartSaVioR/SPARK-28869.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-17 11:15:25 -07:00
Jiajia Li dc0bc7a6eb [MINOR][DOCS] Fix some typos
### What changes were proposed in this pull request?

This PR proposes a few typos:
1. Sparks => Spark's
2. parallize => parallelize
3. doesnt => doesn't

Closes #26140 from plusplusjiajia/fix-typos.

Authored-by: Jiajia Li <jiajia.li@intel.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-17 07:22:01 -07:00
Gengliang Wang 322ec0ba9b [SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default
### What changes were proposed in this pull request?

When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

### Why are the changes needed?

Following the ANSI SQL standard is most reasonable among the 3 policies.

### Does this PR introduce any user-facing change?

Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.

### How was this patch tested?

Unit test

Closes #26107 from gengliangwang/ansiPolicyAsDefault.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-15 10:41:37 -07:00
shivusondur aa1acfe078 [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference
### What changes were proposed in this pull request?
Added the reference for SHOW TABLES sql command.

### Why are the changes needed?
To help the customer usage

### Does this PR introduce any user-facing change?
It updates the Sql command reference doc.

### How was this patch tested?
<details>
<summary> Attached the Snap</summary>

![image](https://user-images.githubusercontent.com/7912929/66623173-1eac1b80-ec08-11e9-8357-9f6323e5fc48.png)

![image](https://user-images.githubusercontent.com/7912929/65384657-87f3e980-dd42-11e9-90fa-6650ee68e005.png)

</details>

Closes #25561 from shivusondur/jiraSHOWTBLS.

Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-12 09:21:44 -05:00
Bryan Cutler 6390f02f9f [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
### What changes were proposed in this pull request?

Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.

### Why are the changes needed?

Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.

Closes #26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-11 09:19:34 +09:00
Luca Canali 2b3c3793c9 [SPARK-29032][FOLLOWUP][DOCS] Add PrometheusServlet in the monitoring documentation
This adds an entry about PrometheusServlet to the documentation, following SPARK-29032

### Why are the changes needed?

The monitoring documentation lists all the available metrics sinks, this should be added to the list for completeness.

Closes #26081 from LucaCanali/FollowupSpark29032.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-10 08:57:53 -07:00
Sean Owen 3b0bca42ac [SPARK-29401][FOLLOWUP] Additional cases where a .parallelize call with Array is ambiguous in 2.13
This is just a followup on https://github.com/apache/spark/pull/26062 -- see it for more detail.

I think we will eventually find more cases of this. It's hard to get them all at once as there are many different types of compile errors in earlier modules. I'm trying to address them in as a big a chunk as possible.

Closes #26074 from srowen/SPARK-29401.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-09 10:27:05 -07:00
Maxim Gekk c97b3ed279 [SPARK-24640][SQL][FOLLOWUP] Update the SQL migration guide about size(NULL)
### What changes were proposed in this pull request?
The commit 4e6d31f570 changed default behavior of `size()` for the `NULL` input. In this PR, I propose to update the SQL migration guide.

### Why are the changes needed?
To inform users about new behavior of the `size()` function for the `NULL` input.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #26066 from MaxGekk/size-null-migration-guide.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-09 16:37:35 +08:00
Sean Owen ee83d09b53 [SPARK-29401][CORE][ML][SQL][GRAPHX][TESTS] Replace calls to .parallelize Arrays of tuples, ambiguous in Scala 2.13, with Seqs of tuples
### What changes were proposed in this pull request?

Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like:
```
[ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives:
  (x: Unit,xs: Unit*)Array[Unit] <and>
  (x: Double,xs: Double*)Array[Double] <and>
  (x: Float,xs: Float*)Array[Float] <and>
  (x: Long,xs: Long*)Array[Long] <and>
  (x: Int,xs: Int*)Array[Int] <and>
  (x: Char,xs: Char*)Array[Char] <and>
  (x: Short,xs: Short*)Array[Short] <and>
  (x: Byte,xs: Byte*)Array[Byte] <and>
  (x: Boolean,xs: Boolean*)Array[Boolean]
 cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int))
```
Using a `Seq` instead appears to resolve it, and is effectively equivalent.

### Why are the changes needed?

To better cross-build for 2.13.

### Does this PR introduce any user-facing change?

None.

### How was this patch tested?

Existing tests.

Closes #26062 from srowen/SPARK-29401.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-08 20:22:02 -07:00
sandeep katta 69b0cc1962 [SPARK-28797][DOC] Document DROP FUNCTION statement in SQL Reference
### What changes were proposed in this pull request?
Add DROP FUNCTION sql description in SQL reference

### Why are the changes needed?
Currently from spark there is no complete sql guide is present, so it is better to document all the sql commands, this jira is sub part of this task.

### Does this PR introduce any user-facing change?
Yes before user cannot find any reference for drop function command in the spark docs.

After Fix:
![image](https://user-images.githubusercontent.com/35216143/66134570-240cd300-e616-11e9-9c78-259c0d355378.png)

![image](https://user-images.githubusercontent.com/35216143/65397825-d059e880-ddd0-11e9-8bd3-a65ccae56063.png)

![image](https://user-images.githubusercontent.com/35216143/66404731-9f032e80-ea06-11e9-8fef-1e266efa4c66.png)

### How was this patch tested?
tested with jekyll build

Closes #25553 from sandeep-katta/28797.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-08 19:47:39 -05:00
Xingbo Jiang 56a3bebb1b [SPARK-27492][DOC][FOLLOWUP] Update resource scheduling user docs
### What changes were proposed in this pull request?

Fix a config name typo from the resource scheduling user docs. In case users might get confused with the wrong config name, we'd better fix this typo.

### How was this patch tested?

Document change, no need to run test.

Closes #26047 from jiangxb1987/doc.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-07 16:21:39 -07:00
Huaxin Gao f0534fb9e5 [SPARK-28816][DOC][SQL] Document ADD JAR statement in SQL Reference
### What changes were proposed in this pull request?
document ADD JAR statement in SQL Reference

### Why are the changes needed?
To complete SQL reference

### Does this PR introduce any user-facing change?
yes

after change:
![image](https://user-images.githubusercontent.com/13592258/66337691-80147780-e8f4-11e9-9d7c-7c1e7ff5379a.png)

![image](https://user-images.githubusercontent.com/13592258/66337704-860a5880-e8f4-11e9-93fa-789695de29d7.png)

![image](https://user-images.githubusercontent.com/13592258/66337721-8b67a300-e8f4-11e9-9056-998187a16c7b.png)

![image](https://user-images.githubusercontent.com/13592258/66337736-928eb100-e8f4-11e9-91c5-d8935a7b93b5.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25895 from huaxingao/spark_28816.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-07 13:39:03 -05:00
Huaxin Gao 5a512e86e9 [SPARK-28800][DOC][SQL] Document REPAIR TABLE statement in SQL Reference
### What changes were proposed in this pull request?
Document REPAIR TABLE statement in SQL Reference.

### Why are the changes needed?
To complete SQL reference.

### Does this PR introduce any user-facing change?
Yes.

After the change, we will have the following
![image](https://user-images.githubusercontent.com/13592258/66271480-461f7480-e813-11e9-9b40-cbffec1221ae.png)

![image](https://user-images.githubusercontent.com/13592258/66261968-4fb1c980-e78c-11e9-9db0-fcd6f458fd39.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25884 from huaxingao/spark-28800.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-06 11:19:13 -05:00
Huaxin Gao 228b1ea96c [SPARK-28813][DOC][SQL] Document SHOW CREATE TABLE in SQL Reference
### What changes were proposed in this pull request?
Document SHOW CREATE TABLE statement in SQL Reference

### Why are the changes needed?
To complete the SQL reference.

### Does this PR introduce any user-facing change?
Yes.

after the change:

![image](https://user-images.githubusercontent.com/13592258/66239427-b2349800-e6ae-11e9-8f78-f9e8ed85ab3b.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25885 from huaxingao/spark-28813.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-04 16:16:00 -05:00
HyukjinKwon 0f48aafab8 [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)
### What changes were proposed in this pull request?

This PR proposes:

1. Use `is.data.frame` to check if it is a DataFrame.
2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

### Why are the changes needed?

1. To support SparkR with Arrow 0.14
2. To check if there's any regression and if it works correctly.

### Does this PR introduce any user-facing change?

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

**Before:**

```
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument
```

**After:**

```
   gear
1     5
2     5
3     5
4     4
5     4
6     4
7     4
8     5
9     5
...
```

### How was this patch tested?

AppVeyor

Closes #25993 from HyukjinKwon/arrow-r-appveyor.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-04 08:56:45 +09:00
Terry Kim f2ead4d0b5 [SPARK-28970][SQL] Implement USE CATALOG/NAMESPACE for Data Source V2
### What changes were proposed in this pull request?
This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#)

It also exposes `currentCatalog` in `CatalogManager`.

Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog).

### Why are the changes needed?
There is currently no mechanism to change current catalog/namespace thru SQL commands.

### Does this PR introduce any user-facing change?
Yes, you can perform the following:
```scala
// Sets the current catalog to 'testcat'
spark.sql("USE CATALOG testcat")

// Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'.
spark.sql("USE ns1.ns2 IN testcat")

// Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace.
spark.sql("SHOW NAMESPACES")
```

### How was this patch tested?
Added new unit tests.

Closes #25771 from imback82/use_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-02 21:55:21 +08:00
Jungtaek Lim (HeartSaVioR) 39eb79ac4b [SPARK-28074][SS] Log warn message on possible correctness issue for multiple stateful operations in single query
## What changes were proposed in this pull request?

Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch.

This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users.

This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative.

## How was this patch tested?

Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query.

Closes #24890 from HeartSaVioR/SPARK-28074.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-30 08:18:23 -05:00
Maxim Gekk 4bffcf5a34 [SPARK-29275][SQL][DOC] Describe special date/timestamp values in the SQL migration guide
### What changes were proposed in this pull request?

Updated the SQL migration guide regarding to recently supported special date and timestamp values, see https://github.com/apache/spark/pull/25716 and https://github.com/apache/spark/pull/25708.

Closes #25834

### Why are the changes needed?
To let users know about new feature in Spark 3.0.

### Does this PR introduce any user-facing change?
No

Closes #25948 from MaxGekk/special-values-migration-guide.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-27 10:36:20 -07:00
Tomoko Komiyama 8beb736a00 [SPARK-29256][DOCS] Fix typo in building document
### What changes were proposed in this pull request?
 Changed 'Phive-thriftserver' to ' -Phive-thriftserver'.

### Why are the changes needed?
 Typo

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Manually tested.

Closes #25937 from TomokoKomiyama/fix-build-doc.

Authored-by: Tomoko Komiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-26 08:23:43 -05:00
WeichenXu d8b0914c2e [SPARK-28957][SQL] Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar"
### What changes were proposed in this pull request?

Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar"

### Why are the changes needed?
Providing spark side config entry for hive configurations.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT.

Closes #25661 from WeichenXu123/add_hive_conf.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-25 15:54:44 +08:00
Jungtaek Lim (HeartSaVioR) 4513f1c0dc [SPARK-26848][SQL][SS] Introduce new option to Kafka source: offset by timestamp (starting/ending)
## What changes were proposed in this pull request?

This patch introduces new options "startingOffsetsByTimestamp" and "endingOffsetsByTimestamp" to set specific timestamp per topic (since we're unlikely to set the different value per partition) to let source starts reading from offsets which have equal of greater timestamp, and ends reading until offsets which have equal of greater timestamp.

The new option would be optional of course, and take preference over existing offset options.

## How was this patch tested?

New unit tests added. Also manually tested basic functionality with Kafka 2.0.0 server.

Running query below

```
val df = spark.read.format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "spark_26848_test_v1,spark_26848_test_2_v1")
  .option("startingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669142193, "spark_26848_test_2_v1": 1549669240965}""")
  .option("endingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669265676, "spark_26848_test_2_v1": 1549699265676}""")
  .load().selectExpr("CAST(value AS STRING)")

df.show()
```

with below records (one string which number part remarks when they're put after such timestamp) in

topic `spark_26848_test_v1`
```
hello1 1549669142193
world1 1549669142193
hellow1 1549669240965
world1 1549669240965
hello1 1549669265676
world1 1549669265676
```

topic `spark_26848_test_2_v1`

```
hello2 1549669142193
world2 1549669142193
hello2 1549669240965
world2 1549669240965
hello2 1549669265676
world2 1549669265676
```

the result of `df.show()` follows:
```
+--------------------+
|               value|
+--------------------+
|world1 1549669240965|
|world1 1549669142193|
|world2 1549669240965|
|hello2 1549669240965|
|hellow1 154966924...|
|hello2 1549669265676|
|hello1 1549669142193|
|world2 1549669265676|
+--------------------+
```

Note that endingOffsets (as well as endingOffsetsByTimestamp) are exclusive.

Closes #23747 from HeartSaVioR/SPARK-26848.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-23 19:25:36 -05:00
xy_xin 655356e825 [SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan
### What changes were proposed in this pull request?

This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:
```
UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?
```

### Why are the changes needed?

With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New test cases added.

Closes #25626 from xianyinxin/SPARK-28892.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-23 19:25:56 +08:00