### What changes were proposed in this pull request?
The Kafka offset reader which uses `AdminClient` still uses `UninterruptibleThread` to call it. Since there is no evidence that `AdminClient` suffers from similar issues like [KAFKA-1894](https://issues.apache.org/jira/browse/KAFKA-1894) I'm removing `UninterruptibleThread` usage. In order to put the `AdminClient` under stress and make sure it works I've created the following standalone application: https://github.com/gaborgsomogyi/kafka-admin-interruption
What this PR contains:
* Removed `UninterruptibleThread` from `KafkaOffsetReaderAdmin`
* Removed/modified comments which are not true
* Adapted `KafkaRelationSuite`
* Renamed `partitionsAssignedToConsumer` to `partitionsAssignedToAdmin`
### Why are the changes needed?
`KafkaOffsetReaderAdmin` doesn't need `UninterruptibleThread` usage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests + manually with simple Kafka to Kafka query.
Closes#30668 from gaborgsomogyi/SPARK-32910.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR fixes a build error of `OracleIntegrationSuite` with Scala 2.13.
### Why are the changes needed?
Build should pass with Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed that the build pass with the following command.
```
$ build/sbt -Pdocker-integration-tests -Pscala-2.13 "docker-integration-tests/test:compile"
```
Closes#30660 from sarutak/fix-docker-integration-tests-for-scala-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`.
### Why are the changes needed?
To fix a string interpolation bug.
### Does this PR introduce _any_ user-facing change?
Yes, the string will be correctly constructed.
### How was this patch tested?
Existing tests since they were used in exception/log messages.
Closes#30609 from imback82/fix_cache_str_interporlation.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.
### Why are the changes needed?
Start to prepare Apache Spark 3.2.0.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Pass the CIs.
Closes#30606 from dongjoon-hyun/SPARK-3.2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR extends the connection timeout to the DB server for DB2IntegrationSuite and its variants.
The container image ibmcom/db2 creates a database when it starts up.
The database creation can take over 2 minutes.
DB2IntegrationSuite and its variants use the container image but the connection timeout is set to 2 minutes so these suites almost always fail.
### Why are the changes needed?
To pass those suites.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed the suites pass with the following commands.
```
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.DB2IntegrationSuite"
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.v2.DB2IntegrationSuite"
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite"
Closes#30583 from sarutak/extend-timeout-for-db2.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR add an option to keep container after DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, PostgresIntegrationSuite) finish.
By setting a system property `spark.test.docker.keepContainer` to `true`, we can use this option.
### Why are the changes needed?
If some error occur during the tests, it would be useful to keep the container for debug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed that the container is kept after the test by the following commands.
```
# With sbt
$ build/sbt -Dspark.test.docker.keepContainer=true -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"
# With Maven
$ build/mvn -Dspark.test.docker.keepContainer=true -Pdocker-integration-tests -Phive -Phive-thriftserver -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite test
$ docker container ls
```
I also confirmed that there are no regression for all the subclasses of `DockerJDBCIntegrationSuite` with sbt/Maven.
* MariaDBKrbIntegrationSuite
* DB2KrbIntegrationSuite
* PostgresKrbIntegrationSuite
* MySQLIntegrationSuite
* PostgresIntegrationSuite
* DB2IntegrationSuite
* MsSqlServerintegrationsuite
* OracleIntegrationSuite
* v2.MySQLIntegrationSuite
* v2.PostgresIntegrationSuite
* v2.DB2IntegrationSuite
* v2.MsSqlServerIntegrationSuite
* v2.OracleIntegrationSuite
NOTE: `DB2IntegrationSuite`, `v2.DB2IntegrationSuite` and `DB2KrbIntegrationSuite` can fail due to the too much short connection timeout. It's a separate issue and I'll fix it in #30583Closes#30601 from sarutak/keepContainer.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Add namespaces support in JDBC v2 Table Catalog by making ```JDBCTableCatalog``` extends```SupportsNamespaces```
### Why are the changes needed?
make v2 JDBC implementation complete
### Does this PR introduce _any_ user-facing change?
Yes. Add the following to ```JDBCTableCatalog```
- listNamespaces
- listNamespaces(String[] namespace)
- namespaceExists(String[] namespace)
- loadNamespaceMetadata(String[] namespace)
- createNamespace
- alterNamespace
- dropNamespace
### How was this patch tested?
Add new docker tests
Closes#30473 from huaxingao/name_space.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
remove USING _ in CREATE TABLE in JDBCTableCatalog docker tests
### Why are the changes needed?
Previously CREATE TABLE syntax forces users to specify a provider so we have to add a USING _ . Now the problem was fix and we need to remove it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#30599 from huaxingao/remove_USING.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This reverts commit SPARK-33212 (cb3fa6c936) mostly with three exceptions:
1. `SparkSubmitUtils` was updated recently by SPARK-33580
2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency.
3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471.
### Why are the changes needed?
According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following.
**1. Spark distribution with `-Phadoop-cloud`**
```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = local-1606806088715).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
+------+--------------+----------------+
scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
```
**2. Spark distribution without `-Phadoop-cloud`**
```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI.
Closes#30508 from dongjoon-hyun/SPARK-33212-REVERT.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details.
The PR contains the following changes:
* Added `AdminClient` based offset fetching
* GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
* GroupId override feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
* Additional unit tests
* Code comment changes
* Minor bugfixes here and there
* Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone.
* Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration`
### Why are the changes needed?
Driver may hang forever.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing + additional unit tests.
Cluster test with simple Kafka topic to another topic query.
Documentation:
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.
Closes#29729 from gaborgsomogyi/SPARK-32032.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR changes mariadb_docker_entrypoint.sh to set the proper version automatically for mariadb-plugin-gssapi-server.
The proper version is based on the one of mariadb-server.
Also, this PR enables to use arbitrary docker image by setting the environment variable `MARIADB_CONTAINER_IMAGE_NAME`.
### Why are the changes needed?
For `MariaDBKrbIntegrationSuite`, the version of `mariadb-plugin-gssapi-server` is currently set to `10.5.5` in `mariadb_docker_entrypoint.sh` but it's no longer available in the official apt repository and `MariaDBKrbIntegrationSuite` doesn't pass for now.
It seems that only the most recent three versions are available for each major version and they are `10.5.6`, `10.5.7` and `10.5.8` for now.
Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for `mariadb-plugin-gssapi-server`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed that `MariaDBKrbIntegrationSuite` passes with the following commands.
```
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"
```
In this case, we can see what version of `mariadb-plugin-gssapi-server` is going to be installed in the following container log message.
```
Installing mariadb-plugin-gssapi-server=1:10.5.8+maria~focal
```
Or, we can set MARIADB_CONTAINER_IMAGE_NAME for a specific version of MariaDB.
```
$ MARIADB_DOCKER_IMAGE_NAME=mariadb:10.5.6 build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"
```
```
Installing mariadb-plugin-gssapi-server=1:10.5.6+maria~focal
```
Closes#30515 from sarutak/fix-MariaDBKrbIntegrationSuite.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
change
```
./build/sbt -Pdocker-integration-tests "testOnly *xxxIntegrationSuite"
```
to
```
./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.xxxIntegrationSuite"
```
### Why are the changes needed?
We only want to start v1 ```xxxIntegrationSuite```, not the newly added```v2.xxxIntegrationSuite```.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually checked
Closes#30448 from huaxingao/dockertest.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports:
- `-Ywarn-unused-import` for Scala 2.12
- `-Wconf:cat=unused-imports:e` for Scala 2.13
The other fIles change are remove all unused imports in Spark code
### Why are the changes needed?
Cleanup code and add guarantee to defense against new unused imports
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30351 from LuciferYang/remove-imports-core-module.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed.
[SPARK-33139] has two commit, include a follow up. Revert them both.
### Why are the changes needed?
Revert.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#30367 from leanken/leanken-revert-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules: graphx, external, and examples.
Split per holdenk https://github.com/apache/spark/pull/30323#issuecomment-725159710
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No testing was performed
Closes#30326 from jsoref/spelling-graphx.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Currently in JDBCTableCatalog, we ignore the table options when creating table.
```
// TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog
if (!properties.isEmpty) {
logWarning("Cannot create JDBC table with properties, these properties will be " +
"ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]"))
}
```
### Why are the changes needed?
need to apply the table options when we create table
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
add new test
Closes#30154 from huaxingao/table_options.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There are two similar compilation warnings about procedure-like declaration in Scala 2.13:
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
```
and
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type
```
this pr is the first part to resolve SPARK-33352:
- For constructors method definition add `=` to convert to function syntax
- For without `return type` methods definition add `: Unit =` to convert to function syntax
### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30255 from LuciferYang/SPARK-29392-FOLLOWUP.1.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Override the default SQL strings for:
ALTER TABLE RENAME COLUMN
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MsSQLServer JDBC dialect according to official documentation.
Write MsSqlServer integration tests for JDBC.
### Why are the changes needed?
To add the support for alter table when interacting with MSSql Server.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
added tests
Closes#30038 from ScrapCodes/mssql-dialect.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This change is to support user provided nullable Avro schema for data with non-nullable catalyst schema in Avro writing.
Without this change, when users try to use a nullable Avro schema to write data with a non-nullable catalyst schema, it will throw an `IncompatibleSchemaException` with a message like `Cannot convert Catalyst type StringType to Avro type ["null","string"]`. With this change it will assume that the data is non-nullable, log a warning message for the nullability difference and serialize the data to Avro format with the nullable Avro schema provided.
### Why are the changes needed?
This change is needed because sometimes our users do not have full control over the nullability of the Avro schemas they use, and this change provides them with the flexibility.
### Does this PR introduce _any_ user-facing change?
Yes. Users are allowed to use nullable Avro schemas for data with non-nullable catalyst schemas in Avro writing after the change.
### How was this patch tested?
Added unit tests.
Closes#30224 from bozhang2820/avro-nullable.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR adds a check to RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record.
This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row).
### Why are the changes needed?
SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row.
For example (which assumes V1 Avro reader):
```scala
val df = spark.range(0, 25).toDF("index")
df.write.mode("overwrite").format("avro").save("index_avro")
val loaded = spark.read.format("avro").load("index_avro")
// The following will give the expected size
loaded.collect.size
// The following will give the wrong size
loaded.orderBy("index").collect.size
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests, which fail without the fix.
Closes#30221 from bersprockets/avro_iterator_play.
Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR renames some part of `Seq` in `PostgresIntegrationSuite` to `scala.collection.Seq`.
When I run `docker-integration-test`, I noticed that `PostgresIntegrationSuite` failed due to `ClassCastException`.
The reason is the same as what is resolved in SPARK-29292.
### Why are the changes needed?
To pass `docker-integration-test` for Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran `PostgresIntegrationSuite` fixed and confirmed it successfully finished.
Closes#30166 from sarutak/fix-toseq-postgresql.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support rename column for mysql dialect.
### Why are the changes needed?
At the moment, it does not work for mysql version 5.x. So, we should throw proper exception for that case.
### Does this PR introduce _any_ user-facing change?
Yes, `column rename` with mysql dialect should work correctly.
### How was this patch tested?
Added tests for rename column.
Ran the tests to pass with both versions of mysql.
* `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31`
* `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0`
Closes#30142 from ScrapCodes/mysql-dialect-rename.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Override the default SQL strings in Postgres Dialect for:
- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY
Add new docker integration test suite `jdbc/v2/PostgreSQLIntegrationSuite.scala`
### Why are the changes needed?
supports Postgres specific ALTER TABLE syntax.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test `PostgreSQLIntegrationSuite`
Closes#30089 from huaxingao/postgres_docker.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Override the default SQL strings for:
ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MySQL JDBC dialect according to official documentation.
Write MySQL integration tests for JDBC.
### Why are the changes needed?
Improved code coverage and support mysql dialect for jdbc.
### Does this PR introduce _any_ user-facing change?
Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
### How was this patch tested?
Added tests.
Closes#30025 from ScrapCodes/mysql-dialect.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client.
In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:
```
hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```
which default to:
```
hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster
```
but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`.
Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests).
### Why are the changes needed?
This serves two purposes:
- to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop.
- avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts.
### Does this PR introduce _any_ user-facing change?
When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.
### How was this patch tested?
Relying on existing tests.
Closes#29843 from sunchao/SPARK-29250.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources.
2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs.
3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`.
4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`.
### Why are the changes needed?
To improve test coverage and test all built-in file-based datasources.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the affected test suites.
Closes#30067 from MaxGekk/ds-options-common-test.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.
Change of the PR:
* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive
### Why are the changes needed?
Make SQLConf.get reliable and stable.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test
Closes#30042 from leanken/leanken-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs:
- spark.sql.legacy.avro.datetimeRebaseModeInWrite
- spark.sql.legacy.parquet.datetimeRebaseModeInWrite
This is a follow up https://github.com/apache/spark/pull/28137.
### Why are the changes needed?
1. To improve test coverage
2. To make sure that the metadata key is actually saved to Avro/Parquet files
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the added tests:
```
$ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite"
```
Closes#30061 from MaxGekk/parquet-test-metakey.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
SPARK-32926 added a build test to GitHub Action for Scala 2.13 but it's only with Maven.
As SPARK-32873 reported, some compilation error happens only with SBT so I think we need to add another build test to GitHub Action for SBT.
Unfortunately, we don't have abundant resources for GitHub Actions so instead of just adding the new SBT job, let's replace the existing Maven job with the new SBT job for Scala 2.13.
### Why are the changes needed?
To ensure build test passes even with SBT for Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GitHub Actions' job.
Closes#29958 from sarutak/add-sbt-job-for-scala-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
test-only - > testOnly in docs across the project.
### Why are the changes needed?
Since the sbt version is updated, the older way or running i.e. `test-only` is no longer valid.
### Does this PR introduce _any_ user-facing change?
docs update.
### How was this patch tested?
Manually.
Closes#30028 from ScrapCodes/fix-build/sbt-sample.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
- Override the default SQL strings in the DB2 Dialect for:
* ALTER TABLE UPDATE COLUMN TYPE
* ALTER TABLE UPDATE COLUMN NULLABILITY
- Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala
### Why are the changes needed?
In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By running new integration test suite:
$ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite"
Closes#29972 from huaxingao/db2_docker.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In `AvroUtils`'s `inferSchema()`, propagate Hadoop config from DS options to underlying HDFS file system.
### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("avro").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
unit test added
Closes#29971 from yuningzh-db/avro_options.
Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Override the default SQL strings in the Oracle Dialect for:
- ALTER TABLE ADD COLUMN
- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY
2. Add new docker integration test suite `jdbc/v2/OracleIntegrationSuite.scala`
### Why are the changes needed?
In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some `ALTER TABLE` at the moment. This PR supports Oracle specific `ALTER TABLE`.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By running new integration test suite:
```
$ ./build/sbt -Pdocker-integration-tests "test-only *.OracleIntegrationSuite"
```
Closes#29912 from MaxGekk/jdbcv2-oracle-alter-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to update database versions below for integration tests;
- ibmcom/db2:11.5.0.0a => ibmcom/db2:11.5.4.0 in `DB2[Krb]IntegrationSuite`
- mysql:5.7.28 => mysql:5.7.31 in `MySQLIntegrationSuite`
- postgres:12.0 => postgres:13.0 in `Postgres[Krb]IntegrationSuite`
- mariadb:10.4 => mariadb:10.5 in `MariaDBKrbIntegrationSuite`
Also, this added environmental variables so that we can test with any database version and all the variables are as follows (see documents in the code for how to use all the variables);
- DB2_DOCKER_IMAGE_NAME
- MSSQLSERVER_DOCKER_IMAGE_NAME
- MYSQL_DOCKER_IMAGE_NAME
- POSTGRES_DOCKER_IMAGE_NAME
### Why are the changes needed?
To improve tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked.
Closes#29932 from maropu/UpdateIntegrationTests.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider).
The PR contains the following changes:
* Added connection provider developer API
* Made JDBC connection providers constructor to noarg => needed to load them w/ service loader
* Connection providers are now loaded w/ service loader
* Added tests to load providers independently
* Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config
### Why are the changes needed?
No custom authentication possibility.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
* Existing + additional unit tests
* Docker integration tests
* Tested manually the newly created external JDBC connection provider
Closes#29024 from gaborgsomogyi/SPARK-32001.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr fix all 14 failed cases in `external/avro` module in Scala 2.13, the main change of this pr as follow:
- Manual call `toSeq` in `AvroDeserializer#newWriter` and `SchemaConverters#toSqlTypeHelper` method because the object type for case match is `ArrayBuffer` not `Seq` in Scala 2.13
- Specified `Seq` to `s.c.Seq` when we call `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass 2.13 Build GitHub Action and do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl external/avro -Pscala-2.13 -am
mvn clean test -pl external/avro -Pscala-2.13
```
**Before**
```
Tests: succeeded 197, failed 14, canceled 0, ignored 2, pending 0
*** 14 TESTS FAILED ***
```
**After**
```
Tests: succeeded 211, failed 0, canceled 0, ignored 2, pending 0
All tests passed.
```
Closes#29801 from LuciferYang/fix-external-avro-213.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fix code which causes error when build with sbt and Scala 2.13 like as follows.
```
[error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:251: method with a single empty parameter list overrides method without any parameter list
[error] [warn] override def hasNext(): Boolean = requestOffset < part.untilOffset
[error] [warn]
[error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:294: method with a single empty parameter list overrides method without any parameter list
[error] [warn] override def hasNext(): Boolean = okNext
```
More specifically, what this PR fixes are
* Methods which has an empty parameter list and overrides an method which has no parameter list.
```
override def hasNext(): Boolean = okNext
```
* Methods which has no parameter list and overrides an method which has an empty parameter list.
```
override def next: (Int, Double) = {
```
* Infix operator expression that the operator wraps.
```
3L * math.min(k, numFeatures) * math.min(k, numFeatures)
3L * math.min(k, numFeatures) * math.min(k, numFeatures) +
+ math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures)
math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) *
* math.min(k, numFeatures) + 4L * math.min(k, numFeatures))
```
### Why are the changes needed?
For building Spark with sbt and Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
After this change and #29742 applied, compile passed with the following command.
```
build/sbt -Pscala-2.13 -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile test:compile
```
Closes#29745 from sarutak/fix-code-for-sbt-and-spark-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR renames `SupportsStreamingUpdate` to `SupportsStreamingUpdateAsAppend` as the new interface name represents the actual behavior clearer. This PR also removes the `update()` method (so the interface is more likely a marker), as the implementations of `SupportsStreamingUpdateAsAppend` should support append mode by default, hence no need to trigger some flag on it.
### Why are the changes needed?
SupportsStreamingUpdate was intended to revive the functionality of Streaming update output mode for internal data sources, but despite the name, that interface isn't really used to do actual update on sink; all sinks are implementing this interface to do append, so strictly saying, it's just to support update as append. Renaming the interface would make it clear.
### Does this PR introduce _any_ user-facing change?
No, as the class is only for internal data sources.
### How was this patch tested?
Jenkins test will follow.
Closes#29693 from HeartSaVioR/SPARK-32831.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: Seq[FileStatus])`, they use the `files` along with the original options. `files` in `inferSchema` could have been deduced from the "path" option if the option was present, so this can cause issues (e.g., reading more data, listing the path again) since the "path" option is **added** to the `files`.
### Why are the changes needed?
The current behavior can cause the following issue:
```scala
class TestFileFilter extends PathFilter {
override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}
val path = "/tmp"
val df = spark.range(2)
df.write.json(path + "/p=1")
df.write.json(path + "/p=2")
val extraOptions = Map(
"mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
"mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
)
// This works fine.
assert(spark.read.options(extraOptions).json(path).count == 2)
// The following with "path" option fails with the following:
// assertion failed: Conflicting directory structures detected. Suspicious paths
// file:/tmp
// file:/tmp/p=1
assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2)
```
### Does this PR introduce _any_ user-facing change?
Yes, the above failure doesn't happen and you get the consistent experience when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", path).load`.
### How was this patch tested?
Updated existing tests.
Closes#29437 from imback82/path_bug.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up PR of #29192 that adds integration tests for character arrays in `PostgresIntegrationSuite`.
### Why are the changes needed?
For better test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add tests.
Closes#29397 from maropu/SPARK-32576-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR intends to add tests to check if all the character types in PostgreSQL supported.
The document for character types in PostgreSQL: https://www.postgresql.org/docs/current/datatype-character.htmlCloses#29192.
### Why are the changes needed?
For better test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add tests.
Closes#29394 from maropu/pr29192.
Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: kujon <jakub.korzeniowski@vortexa.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to update Kafka client library to 2.6.0 for Apache Spark 3.1.0.
### Why are the changes needed?
This will bring client-side bug fixes like KAFKA-10134 and KAFKA-10223.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the existing tests.
Closes#29386 from dongjoon-hyun/SPARK-32568.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch is a follow-up to fill the gap in #29272 which missed to also provide `default.api.timeout.ms` as well. #29272 unintentionally changed the behavior on Kafka side timeout which is incompatible with the test timeout. (`default.api.timeout.ms` gets default value which is 60 seconds, longer than test timeout.)
### Why are the changes needed?
We realized the PR for SPARK-32468 (#29272) doesn't work as we expect. See https://github.com/apache/spark/pull/29272#issuecomment-668333483 for more details.
### Does this PR introduce _any_ user-facing change?
No, as it only touches the tests.
### How was this patch tested?
Will trigger builds from Jenkins or Github Action multiple time and confirm.
Closes#29343 from HeartSaVioR/SPARK-32468-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
While I'm implementing SPARK-32032 I've found a bug in Kafka: https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues only later when it's fixed but it would be good to fix it now because SPARK-32032 would like to bring in `AdminClient` where the code blows up with the mentioned `ConfigException`. This would reduce the code changes in the mentioned jira. In this PR I've changed `default.api.timeout.ms` to `request.timeout.ms` which fulfils this condition.
### Why are the changes needed?
Solve later problems and reduce SPARK-32032 PR size.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#29272 from gaborgsomogyi/SPARK-32468.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Structured Streaming Kafka connector tests are now using a deprecated `poll(long)` API which could cause infinite wait. In this PR I've eliminated these calls and replaced them with `AdminClient`.
### Why are the changes needed?
Deprecated `poll(long)` API calls.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#29289 from gaborgsomogyi/SPARK-32482.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
When `spark.sql.caseSensitive` is `false` (by default), check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:
```
### Why are the changes needed?
To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error when `spark.sql.caseSensitive` is `false`:
```Scala
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`
```
Checking of top-level duplicates was introduced by https://github.com/apache/spark/pull/17758.
### Does this PR introduce _any_ user-facing change?
Yes. For the example from SPARK-32431:
ORC:
```scala
java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3
at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
```
JSON:
```scala
+------------+
|StructColumn|
+------------+
| [,,]|
+------------+
```
Parquet:
```scala
+------------+
|StructColumn|
+------------+
| [0,, 1]|
+------------+
```
Avro:
```scala
+------------+
|StructColumn|
+------------+
| [,,]|
+------------+
```
After the changes, Parquet, ORC, JSON and Avro output the same error:
```scala
Found duplicate column(s) in the data schema: `camelcase`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`;
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112)
at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51)
at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67)
```
### How was this patch tested?
Run modified test suites:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*"
```
and added new UT to `SchemaUtilsSuite`.
Closes#29234 from MaxGekk/nested-case-insensitive-column.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.
1. Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control filters pushdown to Avro datasource. It is on by default.
2. Renamed `CSVFilters` to `OrderedFilters`.
3. `OrderedFilters` is used in `AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2)
4. Modified `AvroDeserializer` to return None from the `deserialize` method when pushdown filters return `false`.
### Why are the changes needed?
The changes improve performance on synthetic benchmarks up to **2** times on JDK 11:
```
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 9614 9669 54 0.1 9614.1 1.0X
pushdown disabled 10077 10141 66 0.1 10077.2 1.0X
w/ filters 4681 4713 29 0.2 4681.5 2.1X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite`
- Re-running `AvroReadBenchmark` using Amazon EC2:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes#29145 from MaxGekk/avro-filters-pushdown.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
`UninterruptibleThread` running functionality is baked into `KafkaOffsetReader` which can be extracted into a class. The main intention is to simplify `KafkaOffsetReader` in order to make easier to solve SPARK-32032. In this PR I've made this extraction without functionality change.
### Why are the changes needed?
`UninterruptibleThread` running functionality is baked into `KafkaOffsetReader`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing + additional unit tests.
Closes#29187 from gaborgsomogyi/SPARK-32387.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>