Commit graph

651 commits

Author SHA1 Message Date
Jungtaek Lim (HeartSaVioR) 481fb63f97 [MINOR][SQL][SS] Remove TODO comments as var in case class is discouraged but worth breaking it
### What changes were proposed in this pull request?

This patch removes TODO comments which are left to address changing case classes having vars to normal classes in spark-sql-kafka module - the pattern is actually discouraged, but still worth to break it, as we already use automatic toString implementation and we may be using more.

### Why are the changes needed?

Described above.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #26992 from HeartSaVioR/SPARK-30337.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-26 11:47:41 +09:00
Jungtaek Lim (HeartSaVioR) 2164243526 [SPARK-28144][SPARK-29294][SS][FOLLOWUP] Use SystemTime defined in Kafka Time interface
### What changes were proposed in this pull request?

This patch addresses review comments in #26960 (https://github.com/apache/spark/pull/26960#discussion_r360661930 / https://github.com/apache/spark/pull/26960#discussion_r360661947) which were not addressed in the patch. Addressing these review comments will let the code less dependent on actual implementation as it only relies on `Time` interface in Kafka.

### Why are the changes needed?

These were review comments in previous PR and they bring actual benefit though they're minors.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26979 from HeartSaVioR/SPARK-29294-follow-up.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-24 11:39:03 +09:00
Jungtaek Lim (HeartSaVioR) 7bff2db9ed [SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly
This patch revises Kafka producer pool (cache) to implement 'expire' correctly.

Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache.

This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.)

This patch also creates a new package `producer` under `kafka010`, to hide the details from `kafka010` package. In point of `kafka010` package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly.

Explained above.

Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread.

New and existing UTs.

Closes #26845 from HeartSaVioR/SPARK-21869-revised.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-12-23 14:19:33 -08:00
Kazuaki Ishizaki f31d9a629b [MINOR][DOC][SQL][CORE] Fix typo in document and comments
### What changes were proposed in this pull request?

Fixed typo in `docs` directory and in other directories

1. Find typo in `docs` and apply fixes to files in all directories
2. Fix `the the` -> `the`

### Why are the changes needed?

Better readability of documents

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

No test needed

Closes #26976 from kiszk/typo_20191221.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-21 14:08:58 -08:00
Jungtaek Lim (HeartSaVioR) 8384ff4c9d [SPARK-28144][SPARK-29294][SS] Upgrade Kafka to 2.4.0
### What changes were proposed in this pull request?

This patch upgrades the version of Kafka to 2.4, which supports Scala 2.13.

There're some incompatible changes in Kafka 2.4 which the patch addresses as well:

* `ZkUtils` is removed -> Replaced with `KafkaZkClient`
* Majority of methods are removed in `AdminUtils` -> Replaced with `AdminZkClient`
* Method signature of `Scheduler.schedule` is changed (return type) -> leverage `DeterministicScheduler` to avoid implementing `ScheduledFuture`

### Why are the changes needed?

* Kafka 2.4 supports Scala 2.13

### Does this PR introduce any user-facing change?

No, as Kafka API is known to be compatible across versions.

### How was this patch tested?

Existing UTs

Closes #26960 from HeartSaVioR/SPARK-29294.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-21 14:01:25 -08:00
Sean Owen 7dff3b125d [SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes
### What changes were proposed in this pull request?

Remove usages of Guava that no longer work in Guava 27, and replace with workalikes. I'll comment on key types of changes below.

### Why are the changes needed?

Hadoop 3.2.1 uses Guava 27, so this helps us avoid problems running on Hadoop 3.2.1+ and generally lowers our exposure to Guava.

### Does this PR introduce any user-facing change?

Should not be, but see notes below on hash codes and toString.

### How was this patch tested?

Existing tests will verify whether these changes break anything for Guava 14.
I manually built with an updated version and it compiles with Guava 27; tests running manually locally now.

Closes #26911 from srowen/SPARK-30272.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-20 08:55:04 -06:00
Yuming Wang 696288f623 [INFRA] Reverts commit 56dcd79 and c216ef1
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79

2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1

### Why are the changes needed?
Shouldn't change master.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master

Closes #26915 from wangyum/revert-master.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-16 19:57:44 -07:00
Yuming Wang 56dcd79992 Preparing development version 3.0.1-SNAPSHOT 2019-12-17 01:57:27 +00:00
Yuming Wang c216ef1d03 Preparing Spark release v3.0.0-preview2-rc2 2019-12-17 01:57:21 +00:00
Maxim Gekk b03ce63c05 [SPARK-30258][TESTS] Eliminate warnings of deprecated Spark APIs in tests
### What changes were proposed in this pull request?
In the PR, I propose to move all tests that use deprecated Spark APIs to separate test classes, and add the annotation:
```scala
deprecated("This test suite will be removed.", "3.0.0")
```
The annotation suppress warnings from already deprecated methods and classes.

### Why are the changes needed?
The warnings about deprecated Spark APIs in tests does not indicate any issues because the tests use such APIs intentionally. Eliminating the warnings allows to highlight other warnings that could show real problems.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites and by
- DeprecatedAvroFunctionsSuite
- DeprecatedDateFunctionsSuite
- DeprecatedDatasetAggregatorSuite
- DeprecatedStreamingAggregationSuite
- DeprecatedWholeStageCodegenSuite

Closes #26885 from MaxGekk/eliminate-deprecate-warnings.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-16 18:24:32 -06:00
Jungtaek Lim (HeartSaVioR) 94eb66593a [SPARK-30227][SQL] Add close() on DataWriter interface
### What changes were proposed in this pull request?

This patch adds close() method to the DataWriter interface, which will become the place to cleanup the resource.

### Why are the changes needed?

The lifecycle of DataWriter instance ends at either commit() or abort(). That makes datasource implementors to feel they can place resource cleanup in both sides, but abort() can be called when commit() fails; so they have to ensure they don't do double-cleanup if cleanup is not idempotent.

### Does this PR introduce any user-facing change?

Depends on the definition of user; if they're developers of custom DSv2 source, they have to add close() in their DataWriter implementations. It's OK to just add close() with empty content as they should have already dealt with resource cleanup in commit/abort, but they would love to migrate the resource cleanup logic to close() as it avoids double cleanup. If they're just end users using the provided DSv2 source (regardless of built-in/3rd party), no change.

### How was this patch tested?

Existing tests.

Closes #26855 from HeartSaVioR/SPARK-30227.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-13 16:12:41 +08:00
Fokko Driesprong 99ea324b6f [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas
Follow up of https://github.com/apache/spark/pull/24405

### What changes were proposed in this pull request?
The current implementation of _from_avro_ and _AvroDataToCatalyst_ doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized.

The proposed change is to add a new option `actualSchema` to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to _GenericDatumReader_. If no writer's schema is provided, nothing changes from before.

### Why are the changes needed?
Consider the following example.

```
// schema ID: 1
val schema1 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"}
     ]
}
"""

// schema ID: 2
val schema2 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"},
        {"name": "col3", "type": "string", "default": ""}
     ]
}
"""
```

The two schemas are compatible - i.e. you can use `schema2` to deserialize events serialized with `schema1`, in which case there will be the field `col3` with the default value.

Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. **We want to combine them into one dataframe** for storing or further processing.

With the current `from_avro` function we can only decode each of them with the corresponding schema:

```
scalaval df1 = ... // Avro events created with schema1
df1: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scalaval decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>]

scalaval df2= ... // Avro events created with schema2
df2: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scalaval decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded")
decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]
```

but then `decodedDf1` and `decodedDf2` have different Spark schemas and we can't union them. Instead, with the proposed change we can decode `df1` in the following way:

```
scalaimport scala.collection.JavaConverters._
scalaval decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("actualSchema" -> schema1).asJava) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]
```

so that both dataframes have the same schemas and can be merged.

### Does this PR introduce any user-facing change?
This PR allows users to pass a new configuration but it doesn't affect current code.

### How was this patch tested?
A new unit test was added.

Closes #26780 from Fokko/SPARK-27506.

Lead-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Gianluca Amori <gianluca.amori@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-12-11 01:26:29 -08:00
Shixiong Zhu cfd7ca9a06
Revert "[SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer"
This reverts commit 3641c3dd69.
2019-12-10 13:38:38 -08:00
Jungtaek Lim (HeartSaVioR) 538b8d101c [SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports
### What changes were proposed in this pull request?

This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate.

### Why are the changes needed?

Java linter starts complaining.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

lint-java passed locally

This closes #26819

Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-09 08:57:20 -08:00
Gengliang Wang a717d219a6 [SPARK-30159][SQL][TESTS] Fix the method calls of QueryTest.checkAnswer
### What changes were proposed in this pull request?

Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen.
The actual exceptions are thrown in the trait `QueryTest`.

However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers.

After this PR:
1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results.
2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message.
3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`.

### Why are the changes needed?

We should fix these method calls to perform actual validations in test suites.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #26788 from gengliangwang/fixCheckAnswer.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-09 22:19:08 +09:00
angerszhu da27f91560 [SPARK-29957][TEST] Reset MiniKDC's default enctypes to fit jdk8/jdk11
### What changes were proposed in this pull request?

Hadoop jira: https://issues.apache.org/jira/browse/HADOOP-12911
In this jira, the author said to replace origin Apache Directory project which is not maintained (but not said it won't work well in jdk11) to Apache Kerby which is java binding(fit java version).

And in Flink: https://github.com/apache/flink/pull/9622
Author show the reason why hadoop-2.7.2's  `MminiKdc` failed with jdk11.
Because new encryption types of `es128-cts-hmac-sha256-128` and `aes256-cts-hmac-sha384-192` (for Kerberos 5) enabled by default were added in Java 11.
Spark with `hadoop-2.7's MiniKdc`does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure.

And when I test hadoop-2.7.2's minikdc in local, the kerberos 's debug error message is  read message stream failed, message can't match.

### Why are the changes needed?
Support jdk11 with hadoop-2.7

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
Existed UT

Closes #26594 from AngersZhuuuu/minikdc-3.2.0.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-05 23:12:45 -08:00
Ximo Guanter 54c5087a3a [SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory
### What changes were proposed in this pull request?
When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.

However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.

This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.

### Why are the changes needed?
Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:

it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)
### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tests passed

Closes #26591 from edrevo/temp.

Authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-22 00:19:25 +08:00
Sean Owen 1febd373ea [MINOR][TESTS] Replace JVM assert with JUnit Assert in tests
### What changes were proposed in this pull request?

Use JUnit assertions in tests uniformly, not JVM assert() statements.

### Why are the changes needed?

assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.

### Does this PR introduce any user-facing change?

No. The assertion logic should be identical.

### How was this patch tested?

Existing tests.

Closes #26581 from srowen/assertToJUnit.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 14:04:15 -06:00
shivsood a834dba120 Revert "[SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils
This reverts commit f7e53865 i.e PR #26301 from master

Closes #26583 from shivsood/revert_29644_master.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-18 18:44:16 -08:00
shivsood 32d44b1d0e [SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils
### What changes were proposed in this pull request?
Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType  as setShort() and setByte(). Changes in JDBCUtils.scala
Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType.

#### Problems

- In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively.
```
case ShortType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getShort(pos))
The issue was pointed out by maropu

case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
 stmt.setInt(pos + 1, row.getByte(pos))
```

- Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType()

``` case java.sql.Types.TINYINT       => IntegerType ```

- At line 172 ShortType was wrongly interpreted as IntegerType
``` case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT)) ```

- All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes.

### Why are the changes needed?
A given type should be set using the right type.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Corrected Unit test cases where applicable. Validated in CI/CD
Added a test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows.
```
./build/mvn install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12
```

Closes #26301 from shivsood/shorttype_fix_maropu.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-13 17:56:13 -08:00
Gabor Somogyi 3641c3dd69 [SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer
### What changes were proposed in this pull request?

Kafka producers are now closed when `spark.kafka.producer.cache.timeout` reached which could be significant problem when processing big SQL queries. The workaround was to increase `spark.kafka.producer.cache.timeout` to a number where the biggest SQL query can be finished.

In this PR I've adapted similar solution which already exists on the consumer side, namely applies Apache Commons Pool on the producer side as well. Main advantages choosing this solution:
* Producers are not closed until they're in use
* No manual reference counting needed (which may be error prone)
* Thread-safe by design
* Provides jmx connection to the pool where metrics can be fetched

What this PR contains:
* Introduced producer side parameters to configure pool
* Renamed `InternalKafkaConsumerPool` to `InternalKafkaConnectorPool` and made it abstract
* Created 2 implementations from it: `InternalKafkaConsumerPool` and `InternalKafkaProducerPool`
* Adapted `CachedKafkaProducer` to use `InternalKafkaProducerPool`
* Changed `KafkaDataWriter` and `KafkaDataWriteTask` to release producer even in failure scenario
* Added several new tests
* Extended `KafkaTest` to clear not only producers but consumers as well
* Renamed `InternalKafkaConsumerPoolSuite` to `InternalKafkaConnectorPoolSuite` where only consumer tests are checking the behavior (please see comment for reasoning)

What this PR not yet contains(but intended when the main concept is stable):
* User facing documentation

### Why are the changes needed?
Kafka producer closed after 10 minutes (with default settings).

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.
Cluster tests being started.

Closes #25853 from gaborgsomogyi/SPARK-21869.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-07 17:06:32 -08:00
Kent Yao 9562b26914 [SPARK-29757][SQL] Move calendar interval constants together
### What changes were proposed in this pull request?
```java
  public static final int YEARS_PER_DECADE = 10;
  public static final int YEARS_PER_CENTURY = 100;
  public static final int YEARS_PER_MILLENNIUM = 1000;

  public static final byte MONTHS_PER_QUARTER = 3;
  public static final int MONTHS_PER_YEAR = 12;

  public static final byte DAYS_PER_WEEK = 7;
  public static final long DAYS_PER_MONTH = 30L;

  public static final long HOURS_PER_DAY = 24L;

  public static final long MINUTES_PER_HOUR = 60L;

  public static final long SECONDS_PER_MINUTE = 60L;
  public static final long SECONDS_PER_HOUR = MINUTES_PER_HOUR * SECONDS_PER_MINUTE;
  public static final long SECONDS_PER_DAY = HOURS_PER_DAY * SECONDS_PER_HOUR;

  public static final long MILLIS_PER_SECOND = 1000L;
  public static final long MILLIS_PER_MINUTE = SECONDS_PER_MINUTE * MILLIS_PER_SECOND;
  public static final long MILLIS_PER_HOUR = MINUTES_PER_HOUR * MILLIS_PER_MINUTE;
  public static final long MILLIS_PER_DAY = HOURS_PER_DAY * MILLIS_PER_HOUR;

  public static final long MICROS_PER_MILLIS = 1000L;
  public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * MICROS_PER_MILLIS;
  public static final long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * MICROS_PER_SECOND;
  public static final long MICROS_PER_HOUR = MINUTES_PER_HOUR * MICROS_PER_MINUTE;
  public static final long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR;
  public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY;
  /* 365.25 days per year assumes leap year every four years */
  public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100;

  public static final long NANOS_PER_MICROS = 1000L;
  public static final long NANOS_PER_MILLIS = MICROS_PER_MILLIS * NANOS_PER_MICROS;
  public static final long NANOS_PER_SECOND = MILLIS_PER_SECOND * NANOS_PER_MILLIS;
```
The above parameters are defined in IntervalUtils, DateTimeUtils, and CalendarInterval, some of them are redundant, some of them are cross-referenced.

### Why are the changes needed?
To simplify code, enhance consistency and reduce risks

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

modified uts

Closes #26399 from yaooqinn/SPARK-29757.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-07 19:48:19 +08:00
Jungtaek Lim (HeartSaVioR) 252ecd333f [SPARK-29635][SS] Extract base test suites between Kafka micro-batch sink and Kafka continuous sink
### What changes were proposed in this pull request?

This patch leverages V2 continuous memory stream to extract tests from Kafka micro-batch sink suite and continuous sink suite and deduplicate them. These tests are basically doing the same, except how to run and verify the result.

### Why are the changes needed?

We no longer have same tests spotted on two places - brings 300 lines deletion.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #26292 from HeartSaVioR/SPARK-29635.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-11-06 17:08:42 -08:00
Sean Owen 19b8c71436 [SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+
### What changes were proposed in this pull request?

Update the version of dropwizard metrics that Spark uses for metrics to 4.1.x, from 3.2.x.

### Why are the changes needed?

This helps JDK 9+ support, per for example https://github.com/dropwizard/metrics/pull/1236

### Does this PR introduce any user-facing change?

No, although downstream users with custom metrics may be affected.

### How was this patch tested?

Existing tests.

Closes #26332 from srowen/SPARK-29674.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-03 15:13:06 -08:00
dengziming 28ccd31aee [SPARK-29611][WEBUI] Sort Kafka metadata by the number of messages
### What changes were proposed in this pull request?

Sort metadata by the number of messages in each Kafka partition

### Why are the changes needed?

help to find the data skewness problem.

### Does this PR introduce any user-facing change?

Yes, add a column count to the metadata and sort by count
![image](https://user-images.githubusercontent.com/26023240/67617886-63e06800-f81a-11e9-8718-be3a0100952e.png)

If you set `minPartitions` configurations with structure structured-streaming which doesn't have the Streaming page, my code changes in `DirectKafkaInputDStream` won't affect the WEB UI page just as it shows in the follow image

![image](https://user-images.githubusercontent.com/26023240/68020762-79520800-fcda-11e9-96cd-f0c64a36f505.png)

### How was this patch tested?

Manual test

Closes #26266 from dengziming/feature_ui_optimize.

Lead-authored-by: dengziming <dengziming@growingio.com>
Co-authored-by: dengziming <swzmdeng@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-01 22:46:34 -07:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
Jungtaek Lim (HeartSaVioR) 762db39c15 [SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source
### What changes were proposed in this pull request?

This patch deduplicates code blocks in Kafka data source which are being repeated multiple times in a method.

### Why are the changes needed?

This change would simplify the code and open possibility to simplify future code whenever fields are added to Kafka writer schema.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #26158 from HeartSaVioR/MINOR-deduplicate-kafka-source.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-28 11:14:18 -07:00
Gabor Somogyi 25493919f8 [SPARK-29580][TESTS] Add kerberos debug messages for Kafka secure tests
### What changes were proposed in this pull request?
`org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details:
```
Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database
```
Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs.

### Why are the changes needed?
Failing test doesn't give enough debug information.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I've started the test manually and checked that such additional debug messages show up:
```
>>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Looking for keys for: kafka/localhostEXAMPLE.COM
Added key: 17version: 0
Added key: 23version: 0
Added key: 16version: 0
Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Using builtin default etypes for permitted_enctypes
default etypes for permitted_enctypes: 17 16 23.
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM
MemoryCache: Existing AuthList:
#3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM
#2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM
#1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM
```

Closes #26252 from gaborgsomogyi/SPARK-29580.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 14:11:35 -07:00
redsk 8bd8f492ea [SPARK-29500][SQL][SS] Support partition column when writing to Kafka
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-29500

`KafkaRowWriter` now supports setting the Kafka partition by reading a "partition" column in the input dataframe.

Code changes in commit nr. 1.
Test changes in commit nr. 2.
Doc changes in commit nr. 3.

tcondie dongjinleekr srowen

### Why are the changes needed?
While it is possible to configure a custom Kafka Partitioner with
`.option("kafka.partitioner.class", "my.custom.Partitioner")`, this is not enough for certain use cases. See the Jira issue.

### Does this PR introduce any user-facing change?
No, as this behaviour is optional.

### How was this patch tested?
Two new UT were added and one was updated.

Closes #26153 from redsk/feature/SPARK-29500.

Authored-by: redsk <nicola.bova@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-25 08:06:36 -05:00
Dongjoon Hyun 7ecf968527 [SPARK-29567][TESTS] Update JDBC Integration Test Docker Images
### What changes were proposed in this pull request?

This PR updates JDBC Integration Test DBMS Docker Images.

| DBMS | Docker Image Tag | Release |
| ------ | ------------------ | ------ |
| MySQL | mysql:5.7.28 | Oct 13, 2019 |
| PostgreSQL | postgres:12.0-alpine | Oct 3, 2019 |

* For `MySQL`, `SET GLOBAL sql_mode = ''` is added to disable all strict modes because `test("Basic write test")` creates a table like the following. The latest MySQL rejects `0000-00-00 00:00:00` as TIMESTAMP and causes the test case failure.
```
mysql> desc datescopy;
+-------+-----------+------+-----+---------------------+-----------------------------+
| Field | Type      | Null | Key | Default             | Extra                       |
+-------+-----------+------+-----+---------------------+-----------------------------+
| d     | date      | YES  |     | NULL                |                             |
| t     | timestamp | NO   |     | CURRENT_TIMESTAMP   | on update CURRENT_TIMESTAMP |
| dt    | timestamp | NO   |     | 0000-00-00 00:00:00 |                             |
| ts    | timestamp | NO   |     | 0000-00-00 00:00:00 |                             |
| yr    | date      | YES  |     | NULL                |                             |
+-------+-----------+------+-----+---------------------+-----------------------------+
```
* For `PostgreSQL`, I chose the smallest image in `12` releases. It reduces the image size a lot, `312MB` -> `72.8MB`. This is good for CI/CI testing environment.
```
$ docker images | grep postgres
postgres      12.0-alpine          5b681acb1cfc        2 days ago          72.8MB
postgres      11.4                 53912975086f        3 months ago        312MB
```

Note that
- For `MsSqlServer`, we are using `2017-GA-ubuntu` and the next version `2019-CTP3.2-ubuntu` is still `Community Technology Preview` status.
- For `DB2` and `Oracle`, the official images are not available.

### Why are the changes needed?

This is to make it sure we are testing with the latest DBMS images during preparing `3.0.0`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Since this is the integration test, we need to run this manually.
```
build/mvn install -DskipTests
build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

Closes #26224 from dongjoon-hyun/SPARK-29567.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-23 16:44:10 -07:00
Dongjoon Hyun 8c3469009c [SPARK-29546][TESTS] Recover jersey-guava test dependency in docker-integration-tests
### What changes were proposed in this pull request?

Currently, `docker-integration-tests` is broken in both JDK8/11.
This PR aims to recover JDBC integration test for JDK8/11.

### Why are the changes needed?

While SPARK-28737 upgraded `Jersey` to 2.29 for JDK11, `docker-integration-tests` is broken because `com.spotify.docker-client` still depends on `jersey-guava`. The latest `com.spotify.docker-client` also has this problem.

- https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2
  -> https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-client/2.19
    -> https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-common/2.19
      -> https://mvnrepository.com/artifact/org.glassfish.jersey.bundles.repackaged/jersey-guava/2.19

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual because this is an integration test suite.

```
$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)

$ build/mvn install -DskipTests

$ build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

```
$ java -version
openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)

$ build/mvn install -DskipTests

$ build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

**BEFORE**
```
*** RUN ABORTED ***
  com.spotify.docker.client.exceptions.DockerException: java.util.concurrent.ExecutionException: javax.ws.rs.ProcessingException: java.lang.NoClassDefFoundError: jersey/repackaged/com/google/common/util/concurrent/MoreExecutors
  at com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1607)
  at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1538)
  at com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:387)
  at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:81)
```

**AFTER**
```
Run completed in 47 seconds, 999 milliseconds.
Total number of tests run: 30
Suites: completed 6, aborted 0
Tests: succeeded 30, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
```

Closes #26203 from dongjoon-hyun/SPARK-29546.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-23 00:14:43 -07:00
Sean Owen 2d871ad0e7 [SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo")
### What changes were proposed in this pull request?

Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")`

### Why are the changes needed?

Avoids ~50 deprecation warnings when attempting to build with 2.13.

### Does this PR introduce any user-facing change?

None, should be no functional change at all.

### How was this patch tested?

Existing tests.

Closes #26061 from srowen/SPARK-29392.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-08 20:15:37 -07:00
Gabor Somogyi 6b5e0e2469 [SPARK-29054][SS] Invalidate Kafka consumer when new delegation token available
### What changes were proposed in this pull request?
Kafka consumers are cached. If delegation token is used and the token is expired, then exception is thrown. Such case new consumer is created in a Task retry with the latest delegation token. This can be enhanced by detecting the existence of a new delegation token. In this PR I'm detecting whether the token in the consumer is the same as the latest stored in the `UGI` (`targetServersRegex` must match not to create a consumer with another cluster's token).

### Why are the changes needed?
It would be good to avoid Task retry to pick up the latest delegation token.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing + new unit tests.
Additionally executed the following code snippet to measure `ensureConsumerHasLatestToken` time consumption:
```
    val startTimeNs = System.nanoTime()
    for (i <- 0 until 10000) {
      consumer.ensureConsumerHasLatestToken()
    }
    logInfo(s"It took ${TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTimeNs)} ms" +
      " to call ensureConsumerHasLatestToken 10000 times")
```

And here are the results:
```
19/09/11 14:58:22 INFO KafkaDataConsumerSuite: It took 1058 ms to call ensureConsumerHasLatestToken 10000 times
...
19/09/11 14:58:23 INFO KafkaDataConsumerSuite: It took 780 ms to call ensureConsumerHasLatestToken 10000 times
...
19/09/11 15:12:11 INFO KafkaDataConsumerSuite: It took 1032 ms to call ensureConsumerHasLatestToken 10000 times
...
19/09/11 15:12:11 INFO KafkaDataConsumerSuite: It took 679 ms to call ensureConsumerHasLatestToken 10000 times
```

Closes #25760 from gaborgsomogyi/SPARK-29054.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-03 09:34:31 -07:00
Dongjoon Hyun bd031c2173 [SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings
### What changes were proposed in this pull request?

This PR aims to remove `scalatest` deprecation warnings with the following changes.
- `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar`
- `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser`
- `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers`
- `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks`

### Why are the changes needed?

According to the Jenkins logs, there are 118 warnings about this.
```
 grep "is deprecated" ~/consoleText | grep scalatest | wc -l
     118
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

After Jenkins passes, we need to check the Jenkins log.

Closes #25982 from dongjoon-hyun/SPARK-29307.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 21:00:11 -07:00
Dongjoon Hyun a0b3d7a323 [SPARK-29300][TESTS] Compare catalyst and avro module benchmark in JDK8/11
### What changes were proposed in this pull request?

This PR regenerate the benchmark results in `catalyst` and `avro` module in order to compare JDK8/JDK11 result.

### Why are the changes needed?

This PR aims to verify that there is no regression on JDK11.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is a test-only update. We need to run the benchmark manually.

Closes #25972 from dongjoon-hyun/SPARK-29300.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 17:59:43 -07:00
Sean Owen e1ea806b30 [SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13
### What changes were proposed in this pull request?

Scala 2.13 emits a deprecation warning for procedure-like declarations:

```
def foo() {
 ...
```

This is equivalent to the following, so should be changed to avoid a warning:

```
def foo(): Unit = {
  ...
```

### Why are the changes needed?

It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.

Unfortunately, that makes this quite a big change.

### Does this PR introduce any user-facing change?

No behavior change at all.

### How was this patch tested?

Existing tests.

Closes #25968 from srowen/SPARK-29291.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 10:03:23 -07:00
Jungtaek Lim (HeartSaVioR) d3679a9782 [SPARK-27748][SS][FOLLOWUP] Correct the order of logging token as debug log
### What changes were proposed in this pull request?

This patch fixes the order of elements while logging token. Header columns are printed as

```
"TOKENID", "HMAC", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE"
```

whereas the code prints out actual information as

```
"HMAC"(redacted), "TOKENID", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE"
```

This patch fixes this.

### Why are the changes needed?

Not critical but it doesn't line up with header columns.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A, as it's only logged as debug and it's obvious what/where is the problem and how it can be fixed.

Closes #25935 from HeartSaVioR/SPARK-27748-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-26 08:32:03 -07:00
Burak Yavuz c8159c7941 [SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter
### What changes were proposed in this pull request?

It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.

### Why are the changes needed?

Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.

### Does this PR introduce any user-facing change?

It changes the default save mode for V2 Tables in the DataFrameWriter APIs

### How was this patch tested?

Existing tests

Closes #25876 from brkyvz/removeSM.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-26 15:20:04 +08:00
Jungtaek Lim (HeartSaVioR) 4513f1c0dc [SPARK-26848][SQL][SS] Introduce new option to Kafka source: offset by timestamp (starting/ending)
## What changes were proposed in this pull request?

This patch introduces new options "startingOffsetsByTimestamp" and "endingOffsetsByTimestamp" to set specific timestamp per topic (since we're unlikely to set the different value per partition) to let source starts reading from offsets which have equal of greater timestamp, and ends reading until offsets which have equal of greater timestamp.

The new option would be optional of course, and take preference over existing offset options.

## How was this patch tested?

New unit tests added. Also manually tested basic functionality with Kafka 2.0.0 server.

Running query below

```
val df = spark.read.format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "spark_26848_test_v1,spark_26848_test_2_v1")
  .option("startingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669142193, "spark_26848_test_2_v1": 1549669240965}""")
  .option("endingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669265676, "spark_26848_test_2_v1": 1549699265676}""")
  .load().selectExpr("CAST(value AS STRING)")

df.show()
```

with below records (one string which number part remarks when they're put after such timestamp) in

topic `spark_26848_test_v1`
```
hello1 1549669142193
world1 1549669142193
hellow1 1549669240965
world1 1549669240965
hello1 1549669265676
world1 1549669265676
```

topic `spark_26848_test_2_v1`

```
hello2 1549669142193
world2 1549669142193
hello2 1549669240965
world2 1549669240965
hello2 1549669265676
world2 1549669265676
```

the result of `df.show()` follows:
```
+--------------------+
|               value|
+--------------------+
|world1 1549669240965|
|world1 1549669142193|
|world2 1549669240965|
|hello2 1549669240965|
|hellow1 154966924...|
|hello2 1549669265676|
|hello1 1549669142193|
|world2 1549669265676|
+--------------------+
```

Note that endingOffsets (as well as endingOffsetsByTimestamp) are exclusive.

Closes #23747 from HeartSaVioR/SPARK-26848.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-23 19:25:36 -05:00
Maxim Gekk a6a663c437 [SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks
### What changes were proposed in this pull request?

Refactored SQL-related benchmark and made them depend on `SqlBasedBenchmark`. In particular, creation of Spark session are moved into `override def getSparkSession: SparkSession`.

### Why are the changes needed?

This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

By running the modified benchmarks.

Closes #25828 from MaxGekk/sql-benchmarks-refactoring.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-18 17:52:23 -07:00
Yuming Wang 8c3f27ceb4 [SPARK-28683][BUILD] Upgrade Scala to 2.12.10
## What changes were proposed in this pull request?

This PR upgrade Scala to **2.12.10**.

Release notes:
- Fix regression in large string interpolations with non-String typed splices
- Revert "Generate shallower ASTs in pattern translation"
- Fix regression in classpath when JARs have 'a.b' entries beside 'a/b'

- Faster compiler: 5–10% faster since 2.12.8
- Improved compatibility with JDK 11, 12, and 13
- Experimental support for build pipelining and outline type checking

More details:
https://github.com/scala/scala/releases/tag/v2.12.10
https://github.com/scala/scala/releases/tag/v2.12.9

## How was this patch tested?

Existing tests

Closes #25404 from wangyum/SPARK-28683.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-18 13:30:36 -07:00
Gabor Somogyi 71e7516132 [SPARK-29027][TESTS] KafkaDelegationTokenSuite fix when loopback canonical host name differs from localhost
### What changes were proposed in this pull request?
`KafkaDelegationTokenSuite` fails on different platforms with the following problem:
```
19/09/11 11:07:42.690 pool-1-thread-1-SendThread(localhost:44965) DEBUG ZooKeeperSaslClient: creating sasl client: Client=zkclient/localhostEXAMPLE.COM;service=zookeeper;serviceHostname=localhost.localdomain
...
NIOServerCxn.Factory:localhost/127.0.0.1:0: Zookeeper Server failed to create a SaslServer to interact with a client during session initiation:
javax.security.sasl.SaslException: Failure to initialize security context [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails)]
	at com.sun.security.sasl.gsskerb.GssKrb5Server.<init>(GssKrb5Server.java:125)
	at com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
	at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
	at org.apache.zookeeper.util.SecurityUtils$2.run(SecurityUtils.java:233)
	at org.apache.zookeeper.util.SecurityUtils$2.run(SecurityUtils.java:229)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.zookeeper.util.SecurityUtils.createSaslServer(SecurityUtils.java:228)
	at org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:44)
	at org.apache.zookeeper.server.ZooKeeperSaslServer.<init>(ZooKeeperSaslServer.java:38)
	at org.apache.zookeeper.server.NIOServerCnxn.<init>(NIOServerCnxn.java:100)
	at org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:186)
	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:227)
	at java.lang.Thread.run(Thread.java:748)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails)
	at sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
	at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
	at sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
	at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
	at sun.security.jgss.GSSCredentialImpl.<init>(GSSCredentialImpl.java:62)
	at sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
	at com.sun.security.sasl.gsskerb.GssKrb5Server.<init>(GssKrb5Server.java:108)
	... 13 more
NIOServerCxn.Factory:localhost/127.0.0.1:0: Client attempting to establish new session at /127.0.0.1:33742
SyncThread:0: Creating new log file: log.1
SyncThread:0: Established session 0x100003736ae0000 with negotiated timeout 10000 for client /127.0.0.1:33742
pool-1-thread-1-SendThread(localhost:35625): Session establishment complete on server localhost/127.0.0.1:35625, sessionid = 0x100003736ae0000, negotiated timeout = 10000
pool-1-thread-1-SendThread(localhost:35625): ClientCnxn:sendSaslPacket:length=0
pool-1-thread-1-SendThread(localhost:35625): saslClient.evaluateChallenge(len=0)
pool-1-thread-1-EventThread: zookeeper state changed (SyncConnected)
NioProcessor-1: No server entry found for kerberos principal name zookeeper/localhost.localdomainEXAMPLE.COM
NioProcessor-1: No server entry found for kerberos principal name zookeeper/localhost.localdomainEXAMPLE.COM
NioProcessor-1: Server not found in Kerberos database (7)
NioProcessor-1: Server not found in Kerberos database (7)
```

The problem reproducible if the `localhost` and `localhost.localdomain` order exhanged:
```
[systestgsomogyi-build spark]$ cat /etc/hosts
127.0.0.1   localhost.localdomain localhost localhost4 localhost4.localdomain4
::1         localhost.localdomain localhost localhost6 localhost6.localdomain6
```

The main problem is that `ZkClient` connects to the canonical loopback address (which is not necessarily `localhost`).

### Why are the changes needed?
`KafkaDelegationTokenSuite` failed in some environments.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing unit tests on different platforms.

Closes #25803 from gaborgsomogyi/SPARK-29027.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-17 15:30:18 -07:00
Jungtaek Lim (HeartSaVioR) 88c8d5eed2 [SPARK-23539][SS][FOLLOWUP][TESTS] Add UT to ensure existing query doesn't break with default conf of includeHeaders
### What changes were proposed in this pull request?

This patch adds new UT to ensure existing query (before Spark 3.0.0) with checkpoint doesn't break with default configuration of "includeHeaders" being introduced via SPARK-23539.

This patch also modifies existing test which checks type of columns to also check headers column as well.

### Why are the changes needed?

The patch adds missing tests which guarantees backward compatibility of the change of SPARK-23539.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

UT passed.

Closes #25792 from HeartSaVioR/SPARK-23539-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-16 15:22:04 -05:00
Lee Dongjin 1675d5114e [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
## What changes were proposed in this pull request?

This update adds support for Kafka Headers functionality in Structured Streaming.

## How was this patch tested?

With following unit tests:

- KafkaRelationSuite: "default starting and ending offsets with headers" (new)
- KafkaSinkSuite: "batch - write to kafka" (updated)

Closes #22282 from dongjinleekr/feature/SPARK-23539.

Lead-authored-by: Lee Dongjin <dongjin@apache.org>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-13 12:31:28 -05:00
Wenchen Fan 053dd858d3 [SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes
### What changes were proposed in this pull request?

reorganize the packages of DS v2 interfaces/classes:
1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes
2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes
3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes
4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes

### Why are the changes needed?

Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #25700 from cloud-fan/package.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-12 19:59:34 +08:00
dengziming 8f632d7045 [MINOR][DOCS] Fix few typos in the java docs
JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into  'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'

Closes #25756 from dengziming/feature_fix_typos.

Authored-by: dengziming <dengziming@growingio.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 09:30:03 +09:00
Jungtaek Lim (HeartSaVioR) b62ef8f793 [SPARK-29007][STREAMING][MLLIB][TESTS] Enforce not leaking SparkContext in tests which creates new StreamingContext with new SparkContext
### What changes were proposed in this pull request?

This patch enforces tests to prevent leaking newly created SparkContext while is created via initializing StreamingContext. Leaking SparkContext in test would make most of following tests being failed as well, so this patch applies defensive programming, trying its best to ensure SparkContext is cleaned up.

### Why are the changes needed?

We got some case in CI build where SparkContext is being leaked and other tests are affected by leaked SparkContext. Ideally we should isolate the environment among tests if possible.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Modified UTs.

Closes #25709 from HeartSaVioR/SPARK-29007.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-11 10:29:13 -07:00