ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
yangjie01	02fd52cfbc	[SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are two similar compilation warnings about procedure-like declaration in Scala 2.13: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition ``` and ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type ``` this pr is the first part to resolve SPARK-33352： - For constructors method definition add `=` to convert to function syntax - For without `return type` methods definition add `: Unit =` to convert to function syntax ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-08 12:51:48 -06:00
Bo Zhang	551b504cfe	[SPARK-33316][SQL] Support user provided nullable Avro schema for non-nullable catalyst schema in Avro writing ### What changes were proposed in this pull request? This change is to support user provided nullable Avro schema for data with non-nullable catalyst schema in Avro writing. Without this change, when users try to use a nullable Avro schema to write data with a non-nullable catalyst schema, it will throw an `IncompatibleSchemaException` with a message like `Cannot convert Catalyst type StringType to Avro type ["null","string"]`. With this change it will assume that the data is non-nullable, log a warning message for the nullability difference and serialize the data to Avro format with the nullable Avro schema provided. ### Why are the changes needed? This change is needed because sometimes our users do not have full control over the nullability of the Avro schemas they use, and this change provides them with the flexibility. ### Does this PR introduce _any_ user-facing change? Yes. Users are allowed to use nullable Avro schemas for data with non-nullable catalyst schemas in Avro writing after the change. ### How was this patch tested? Added unit tests. Closes #30224 from bozhang2820/avro-nullable. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-05 12:27:20 +08:00
Bruce Robbins	7e8eb0447b	[SPARK-33314][SQL] Avoid dropping rows in Avro reader ### What changes were proposed in this pull request? This PR adds a check to RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record. This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row). ### Why are the changes needed? SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row. For example (which assumes V1 Avro reader): ```scala val df = spark.range(0, 25).toDF("index") df.write.mode("overwrite").format("avro").save("index_avro") val loaded = spark.read.format("avro").load("index_avro") // The following will give the expected size loaded.collect.size // The following will give the wrong size loaded.orderBy("index").collect.size ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests, which fail without the fix. Closes #30221 from bersprockets/avro_iterator_play. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-05 11:50:11 +09:00
Max Gekk	26b13c70c3	[SPARK-33169][SQL][TESTS] Check propagation of datasource options to underlying file system for built-in file-based datasources ### What changes were proposed in this pull request? 1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources. 2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs. 3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`. 4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`. ### Why are the changes needed? To improve test coverage and test all built-in file-based datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #30067 from MaxGekk/ds-options-common-test. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 17:47:49 +09:00
Max Gekk	38c05af1d5	[SPARK-33163][SQL][TESTS] Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up https://github.com/apache/spark/pull/28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes #30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 10:28:15 +09:00
Yuning Zhang	bbc887bf73	[SPARK-33089][SQL] make avro format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? In `AvroUtils`'s `inferSchema()`, propagate Hadoop config from DS options to underlying HDFS file system. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("avro").options(conf).load(path) ``` The underlying file system will not receive the `conf` options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? unit test added Closes #29971 from yuningzh-db/avro_options. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:18:06 +09:00
yangjie01	664a1719de	[SPARK-32936][SQL] Pass all `external/avro` module UTs in Scala 2.13 ### What changes were proposed in this pull request? This pr fix all 14 failed cases in `external/avro` module in Scala 2.13, the main change of this pr as follow: - Manual call `toSeq` in `AvroDeserializer#newWriter` and `SchemaConverters#toSqlTypeHelper` method because the object type for case match is `ArrayBuffer` not `Seq` in Scala 2.13 - Specified `Seq` to `s.c.Seq` when we call `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13 ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass 2.13 Build GitHub Action and do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl external/avro -Pscala-2.13 -am mvn clean test -pl external/avro -Pscala-2.13 ``` Before ``` Tests: succeeded 197, failed 14, canceled 0, ignored 2, pending 0 * 14 TESTS FAILED * ``` After ``` Tests: succeeded 211, failed 0, canceled 0, ignored 2, pending 0 All tests passed. ``` Closes #29801 from LuciferYang/fix-external-avro-213. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-18 22:24:33 +09:00
Gengliang Wang	de141a3271	[SPARK-32660][SQL][DOC] Show Avro related API in documentation ### What changes were proposed in this pull request? Currently, the Avro related APIs are missing in the documentation https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . This PR is to: 1. Mark internal Avro related classes as private 2. Show Avro related API in Spark official API documentation ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and preview: ![image](https://user-images.githubusercontent.com/1097932/90623042-d156ee00-e1ca-11ea-9edd-2c45b3001fd8.png) ![image](https://user-images.githubusercontent.com/1097932/90623047-d451de80-e1ca-11ea-94ba-02921b64d6f1.png) ![image](https://user-images.githubusercontent.com/1097932/90623058-d6b43880-e1ca-11ea-849a-b9ea9efe6527.png) Closes #29476 from gengliangwang/avroAPIDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-08-21 13:12:43 +08:00
Terry Kim	3d1dce75d9	[SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources ### What changes were proposed in this pull request? When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: Seq[FileStatus])`, they use the `files` along with the original options. `files` in `inferSchema` could have been deduced from the "path" option if the option was present, so this can cause issues (e.g., reading more data, listing the path again) since the "path" option is added to the `files`. ### Why are the changes needed? The current behavior can cause the following issue: ```scala class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } val path = "/tmp" val df = spark.range(2) df.write.json(path + "/p=1") df.write.json(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) // This works fine. assert(spark.read.options(extraOptions).json(path).count == 2) // The following with "path" option fails with the following: // assertion failed: Conflicting directory structures detected. Suspicious paths // file:/tmp // file:/tmp/p=1 assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2) ``` ### Does this PR introduce _any_ user-facing change? Yes, the above failure doesn't happen and you get the consistent experience when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", path).load`. ### How was this patch tested? Updated existing tests. Closes #29437 from imback82/path_bug. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-19 16:23:22 +00:00
Max Gekk	99a855575c	[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources ### What changes were proposed in this pull request? When `spark.sql.caseSensitive` is `false` (by default), check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: ``` ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error when `spark.sql.caseSensitive` is `false`: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase` ``` Checking of top-level duplicates was introduced by https://github.com/apache/spark/pull/17758. ### Does this PR introduce _any_ user-facing change? Yes. For the example from SPARK-32431: ORC: ```scala java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) ... Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3 at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) ``` JSON: ```scala +------------+ \|StructColumn\| +------------+ \| [,,]\| +------------+ ``` Parquet: ```scala +------------+ \|StructColumn\| +------------+ \| [0,, 1]\| +------------+ ``` Avro: ```scala +------------+ \|StructColumn\| +------------+ \| [,,]\| +------------+ ``` After the changes, Parquet, ORC, JSON and Avro output the same error: ```scala Found duplicate column(s) in the data schema: `camelcase`; org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67) ``` ### How was this patch tested? Run modified test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*" ``` and added new UT to `SchemaUtilsSuite`. Closes #29234 from MaxGekk/nested-case-insensitive-column. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-30 06:05:55 +00:00
Max Gekk	d897825d2d	[SPARK-32346][SQL] Support filters pushdown in Avro datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in Avro datasource V1 and V2. 1. Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control filters pushdown to Avro datasource. It is on by default. 2. Renamed `CSVFilters` to `OrderedFilters`. 3. `OrderedFilters` is used in `AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2) 4. Modified `AvroDeserializer` to return None from the `deserialize` method when pushdown filters return `false`. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to 2 times on JDK 11: ``` OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 9614 9669 54 0.1 9614.1 1.0X pushdown disabled 10077 10141 66 0.1 10077.2 1.0X w/ filters 4681 4713 29 0.2 4681.5 2.1X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite` - Re-running `AvroReadBenchmark` using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #29145 from MaxGekk/avro-filters-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-30 01:37:42 +08:00
Sean Owen	c28a6fa511	[SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation ### What changes were proposed in this pull request? Same as https://github.com/apache/spark/pull/29078 and https://github.com/apache/spark/pull/28971 . This makes the rest of the default modules (i.e. those you get without specifying `-Pyarn` etc) compile under Scala 2.13. It does not close the JIRA, as a result. this also of course does not demonstrate that tests pass yet in 2.13. Note, this does not fix the `repl` module; that's separate. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29111 from srowen/SPARK-29292.3. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-15 13:26:28 -07:00
Dilip Biswal	e4499932da	[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node ### What changes were proposed in this pull request? Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones). Before ``` == Physical Plan == * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- BatchScan (1) (1) BatchScan Output [2]: [value#7, id#8] Arguments: [value#7, id#8], ParquetScan(org.apache.spark.sql.test.TestSparkSession17477bbb,Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndexa6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMapfeca1ec6,Vector(isnotnull(id#8), (id#8 > 1)),List(isnotnull(value#7), (value#7 > 2))) (2) ... (3) ... (4) ... ``` After ``` == Physical Plan == * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- BatchScan (1) (1) BatchScan Output [2]: [value#7, id#8] DataFilters: [isnotnull(value#7), (value#7 > 2)] Format: parquet Location: InMemoryFileIndex[....] PartitionFilters: [isnotnull(id#8), (id#8 > 1)] PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1), GreaterThan(value,2)] ReadSchema: struct<value:int> (2) ... (3) ... (4) ... ``` ### Why are the changes needed? The old format is not very readable. This improves the readability of the plan. ### Does this PR introduce any user-facing change? Yes. the explain output will be different. ### How was this patch tested? Added a test case in ExplainSuite. Closes #28425 from dilipbiswal/dkb_dsv2_explain. Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com> Co-authored-by: Dilip Biswal <dkbiswal@apache.org> Signed-off-by: Dilip Biswal <dkbiswal@apache.org>	2020-07-15 01:28:39 -07:00
Wenchen Fan	6edb20df83	[SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in https://github.com/apache/spark/pull/28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-02 06:09:54 +08:00
Max Gekk	37a1fb8d08	[SPARK-31855][SQL][TESTS] Check reading date/timestamp from Avro files w/ and w/o Spark version ### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `external/avro/src/test/resources`: - Files saved by Spark 2.4.5 (`cee4ecbb16`) without meta info `org.apache.spark.version` - `before_1582_date_v2_4_5.avro` with a date column: `avro.schema {"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"int","logicalType":"date"},"null"]}]}` - `before_1582_timestamp_millis_v2_4_5.avro` with a timestamp column: `avro.schema {"type":"record","name":"test","namespace":"logical","fields":[{"name":"dt","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null}]}` - `before_1582_timestamp_micros_v2_4_5.avro` with a timestamp column: `avro.schema {"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]}]}` - Files saved by Spark 2.4.6-rc3 (`570848da7c`) with the meta info `org.apache.spark.version 2.4.6`: - `before_1582_date_v2_4_6.avro` is similar to `before_1582_date_v2_4_5.avro` except Spark version in parquet meta info. - `before_1582_timestamp_micros_v2_4_6.avro` is similar to `before_1582_timestamp_micros_v2_4_5.avro` except meta info. - `before_1582_timestamp_millis_v2_4_6.avro` is similar to `before_1582_timestamp_millis_v2_4_5.avro` except meta info. 2. Removed a few avro files becaused they are replaced by Avro files generated by Spark 2.4.5 above. 3. Add new test "generate test files for checking compatibility with Spark 2.4" to `AvroSuite` (marked as ignored). The parquet files above were generated by this test. 4. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `AvroSuite` to use new parquet files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By `AvroV1Suite` and `AvroV2Suite`. Closes #28664 from MaxGekk/avro-update-resource-files. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-29 05:18:37 +00:00
Wenchen Fan	fd2d55c991	[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files ### What changes were proposed in this pull request? When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not. ### Why are the changes needed? Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values. ### Does this PR introduce _any_ user-facing change? Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config. ### How was this patch tested? updated tests Closes #28477 from cloud-fan/rebase. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-14 12:32:40 +09:00
Wenchen Fan	a5ebbacf53	[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata ### What changes were proposed in this pull request? This PR adds a new parquet/avro file metadata: `org.apache.spark.legacyDatetime`. It indicates that the file was written with the "rebaseInWrite" config enabled, and spark need to do rebase when reading it. This makes Spark be able to do rebase more smartly: 1. If we don't know which Spark version writes the file, do rebase if the "rebaseInRead" config is true. 2. If the file was written by Spark 2.4 and earlier, then do rebase. 3. If the file was written by Spark 3.0 and later, do rebase if the `org.apache.spark.legacyDatetime` exists in file metadata. ### Why are the changes needed? It's very easy to have mixed-calendar parquet/avro files: e.g. A user upgrades to Spark 3.0 and writes some parquet files to an existing directory. Then he realizes that the directory contains legacy datetime values before 1582. However, it's too late and he has to find out all the legacy files manually and read them separately. To support mixed-calendar parquet/avro files, we need to decide to rebase or not based on the file metadata. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated test Closes #28137 from cloud-fan/datetime. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-22 00:26:23 +09:00
Max Gekk	e2d9399602	[SPARK-31359][SQL] Speed up timestamps rebasing ### What changes were proposed in this pull request? In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: \| i \| local timestamp \| Proleptic Greg. seconds \| Hybrid (Julian+Greg) seconds \| difference in minutes\| \| -- \| ------- \|----\|----\| ---- \| \|0\|0001-01-01 00:00\|-62135568422\|-62135740800\|-2872\| \|1\|0100-03-01 00:00\|-59006333222\|-59006419200\|-1432\| \|...\|...\|...\|...\|...\| \|13\|1582-10-15 00:00\|-12219264422\|-12219264000\|7\| \|14\|1883-11-18 12:00\|-2717640000\|-2717640000\|0\| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The _Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_ time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. ### Why are the changes needed? To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ x3.8 faster - Loading timestamps from parquet files is ~x2.8 faster. - Loading timestamps by Vectorized reader ~x4.6 faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127): \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28119 from MaxGekk/optimize-rebase-micros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-09 05:23:52 +00:00
Wenchen Fan	6b1ca886c0	[SPARK-31327][SQL] Write Spark version into Avro file metadata ### What changes were proposed in this pull request? Write Spark version into Avro file metadata ### Why are the changes needed? The version info is very useful for backward compatibility. This is also done in parquet/orc. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test Closes #28102 from cloud-fan/avro. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-03 12:43:33 +00:00
Maxim Gekk	c5323d2e8d	[SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write ### What changes were proposed in this pull request? In the PR, I propose to replace the following SQL configs: 1. `spark.sql.legacy.parquet.rebaseDateTime.enabled` by - `spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled` (`false` by default). The config enables rebasing dates/timestamps while saving to Parquet files. If it is set to `true`, dates/timestamps are converted to local date-time in Proleptic Gregorian calendar, date-time fields are extracted, and used in building new local date-time in the hybrid calendar (Julian + Gregorian). The resulted local date-time is converted to days or microseconds since the epoch. - `spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled` (`false` by default). The config enables rebasing of dates/timestamps in reading from Parquet files. 2. `spark.sql.legacy.avro.rebaseDateTime.enabled` by - `spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled` (`false` by default). It enables dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar via local date/timestamps. - `spark.sql.legacy.avro.rebaseDateTimeInRead.enabled` (`false` by default). It enables rebasing dates/timestamps from the hybrid calendar to Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z. ### Why are the changes needed? This allows to load dates/timestamps saved by Spark 2.4, and save to Parquet/Avro files without rebasing. And the reverse use case - load data saved by Spark 3.0, and save it in the form which is compatible with Spark 2.4. ### Does this PR introduce any user-facing change? Yes, users have to use new SQL configs. Old SQL configs are removed by the PR. ### How was this patch tested? By existing test suites `AvroV1Suite`, `AvroV2Suite` and `ParquetIOSuite`. Closes #28082 from MaxGekk/split-rebase-configs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-01 04:56:05 +00:00
Maxim Gekk	9f0c010a5c	[SPARK-31277][SQL][TESTS] Migrate `DateTimeTestUtils` from `TimeZone` to `ZoneId` ### What changes were proposed in this pull request? In the PR, I propose to change types of `DateTimeTestUtils` values and functions by replacing `java.util.TimeZone` to `java.time.ZoneId`. In particular: 1. Type of `ALL_TIMEZONES` is changed to `Seq[ZoneId]`. 2. Remove `val outstandingTimezones: Seq[TimeZone]`. 3. Change the type of the time zone parameter in `withDefaultTimeZone` to `ZoneId`. 4. Modify affected test suites. ### Why are the changes needed? Currently, Spark SQL's date-time expressions and functions have been already ported on Java 8 time API but tests still use old time APIs. In particular, `DateTimeTestUtils` exposes functions that accept only TimeZone instances. This is inconvenient, and CPU consuming because need to convert TimeZone instances to ZoneId instances via strings (zone ids). ### Does this PR introduce any user-facing change? No ### How was this patch tested? By affected test suites executed by jenkins builds. Closes #28033 from MaxGekk/with-default-time-zone. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 21:14:25 +08:00
Maxim Gekk	b402bc900a	[SPARK-31183][SQL][FOLLOWUP] Move rebase tests to `AvroSuite` and check the rebase flag out of function bodies ### What changes were proposed in this pull request? 1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. ### Why are the changes needed? 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Avro tests via the command `build/sbt avro/test` Closes #27964 from MaxGekk/rebase-avro-datetime-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-20 19:02:54 +09:00
Maxim Gekk	4766a36647	[SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro ### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Avro datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into Avro files. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.avro.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds. 2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to Avro files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from Avro files if the SQL config is on. 5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ \|date \| +----------+ \|1001-01-07\| +----------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ \|date \| +----------+ \|1001-01-01\| +----------+ ``` ### How was this patch tested? 1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro") scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df2: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> :paste // Entering paste mode (ctrl-D to finish) val timestampSchema = s""" \| { \| "namespace": "logical", \| "type": "record", \| "name": "test", \| "fields": [ \| {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null} \| ] \| } \|""".stripMargin // Exiting paste mode, now interpreting. scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro") ``` 2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. : - `rebasing microseconds timestamps in write` - `rebasing milliseconds timestamps in write` - `rebasing dates in write` Closes #27953 from MaxGekk/rebase-avro-datetime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 13:57:49 +08:00
gatorsmile	28b8713036	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT ### What changes were proposed in this pull request? This patch is to bump the master branch version to 3.1.0-SNAPSHOT. ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? N/A ### How was this patch tested? N/A Closes #27698 from gatorsmile/updateVersion. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-25 19:44:31 -08:00
Guy Khazma	2d59ca464e	[SPARK-30475][SQL] File source V2: Push data filters for file listing ### What changes were proposed in this pull request? Follow up on [SPARK-30428](https://github.com/apache/spark/pull/27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](https://github.com/apache/spark/pull/27112#discussion_r364757217)). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes #27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-01-20 20:20:37 -08:00
Maxim Gekk	fd69533593	[SPARK-30482][CORE][SQL][TESTS][FOLLOW-UP] Output caller info in log appenders while reaching the limit ### What changes were proposed in this pull request? In the PR, I propose to output additional msg from the tests where a log appender is added. The message is printed as a part of `IllegalStateException` in the case of reaching the limit of maximum number of logged events. ### Why are the changes needed? If a log appender is not removed from the log4j appenders list. the caller message could help to investigate the problem and find the test which doesn't remove the log appender. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the modified test suites `AvroSuite`, `CSVSuite`, `ResolveHintsSuite` and etc. Closes #27296 from MaxGekk/assign-name-to-log-appender. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-21 10:19:07 +09:00
Maxim Gekk	ab048990e0	[SPARK-30558][SQL] Avoid rebuilding `AvroOptions` per each partition ### What changes were proposed in this pull request? In the PR, I propose move out creation of `AvroOption` from `AvroPartitionReaderFactory.buildReader`, and create it earlier in `AvroScan.createReaderFactory`. ### Why are the changes needed? - To avoid building `AvroOptions` from a map of Avro options and Hadoop conf per each partition. - If an instance of `AvroOptions` is built only once at the driver side, we could output warnings while parsing Avro options and don't worry about noisiness of the warnings. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `AvroSuite` Closes #27272 from MaxGekk/avro-options-once-for-read. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-20 15:22:23 +09:00
Maxim Gekk	0f3d744c3f	[MINOR][TESTS] Remove unsupported `header` option in AvroSuite ### What changes were proposed in this pull request? In the PR, I propose to remove the `header` option in the `Avro source v2: support partition pruning` test. ### Why are the changes needed? The option is not supported by Avro, and may misleading readers. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `AvroSuite`. Closes #27203 from MaxGekk/avro-suite-remove-header-option. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-14 11:49:50 -08:00
Maxim Gekk	db7262a00f	[SPARK-30509][SQL] Fix deprecation log warning in Avro schema inferring ### What changes were proposed in this pull request? In the PR, I propose to check the `ignoreExtensionKey` option in the case insensitive map of `AvroOption`. ### Why are the changes needed? The map `options` passed to `AvroUtils.inferSchema` contains all keys in the lower cases in fact. Actually, the map is converted from a `CaseInsensitiveStringMap`. Consequently, the check `3663dbe541/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala (L45)` always return `false`, and the deprecation log warning is never printed. ### Does this PR introduce any user-facing change? Yes, after the changes the log warning is printed once. ### How was this patch tested? Added new test to `AvroSuite` which checks existence of log warning. Closes #27200 from MaxGekk/avro-fix-ignoreExtension-contains. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-14 11:48:07 -08:00
Maxim Gekk	f5118f81e3	[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/10 \| - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-12 13:18:19 -08:00
Gengliang Wang	94fc0e3235	[SPARK-30428][SQL] File source V2: support partition pruning ### What changes were proposed in this pull request? File source V2: support partition pruning. Note: subquery predicates are not pushed down for partition pruning even after this PR, due to the limitation for the current data source V2 API and framework. The rule `PlanSubqueries` requires the subquery expression to be in the children or class parameters in `SparkPlan`, while the condition is not satisfied for `BatchScanExec`. ### Why are the changes needed? It's important for reading performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New unit tests for all the V2 file sources Closes #27112 from gengliangwang/PartitionPruningInFileScan. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-09 21:53:37 +08:00
Gengliang Wang	0d589f410b	[SPARK-30267][SQL][FOLLOWUP] Use while loop in Avro Array Deserializer ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26907 It changes the for loop `for (element <- array.asScala)` to while loop ### Why are the changes needed? As per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, we should use while loop for the performance-sensitive code. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #27127 from gengliangwang/SPARK-30267-FollowUp. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-01-07 22:39:25 -08:00
Ximo Guanter	604d6799df	[SPARK-30226][SQL] Remove withXXX functions in WriteBuilder ### What changes were proposed in this pull request? Adding a `LogicalWriteInfo` interface as suggested by cloud-fan in https://github.com/apache/spark/pull/25990#issuecomment-555132991 ### Why are the changes needed? It provides compile-time guarantees where we previously had none, which will make it harder to introduce bugs in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Compiles and passes tests Closes #26678 from edrevo/add-logical-write-info. Lead-authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com> Co-authored-by: Ximo Guanter Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-06 23:53:45 +08:00
Steven Aerts	e64512558f	[SPARK-30267][SQL] Avro arrays can be of any List The Deserializer assumed that avro arrays are always of type `GenericData$Array` which is not the case. Assuming they are from java.util.List is safer and fixes a ClassCastException in some avro code. ### What changes were proposed in this pull request? Java.util.List has all the necessary methods and is the base class of GenericData$Array. ### Why are the changes needed? To prevent the following exception in more complex avro objects: ``` java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.avro.generic.GenericData$Array at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170) at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169) at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314) at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310) at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332) at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329) at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56) at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? The current tests already test this behavior. In essesence this patch just changes a type case to a more basic type. So I expect no functional impact. Closes #26907 from steven-aerts/spark-30267. Authored-by: Steven Aerts <steven.aerts@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-01-03 12:44:49 -08:00
Gengliang Wang	07593d362f	[SPARK-27506][SQL][FOLLOWUP] Use option `avroSchema` to specify an evolved schema in `from_avro` ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26780 In https://github.com/apache/spark/pull/26780, a new Avro data source option `actualSchema` is introduced for setting the original Avro schema in function `from_avro`, while the expected schema is supposed to be set in the parameter `jsonFormatSchema` of `from_avro`. However, there is another Avro data source option `avroSchema`. It is used for setting the expected schema in readiong and writing. This PR is to use the option `avroSchema` option for reading Avro data with an evolved schema and remove the new one `actualSchema` ### Why are the changes needed? Unify and simplify the Avro data source options. ### Does this PR introduce any user-facing change? Yes. To deserialize Avro data with an evolved schema, before changes: ``` from_avro('col, expectedSchema, ("actualSchema" -> actualSchema)) ``` After changes: ``` from_avro('col, actualSchema, ("avroSchema" -> expectedSchema)) ``` The second parameter is always the actual Avro schema after changes. ### How was this patch tested? Update the existing tests in https://github.com/apache/spark/pull/26780 Closes #27045 from gengliangwang/renameAvroOption. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-30 18:14:21 +09:00
Yuming Wang	696288f623	[INFRA] Reverts commit `56dcd79` and `c216ef1` ### What changes were proposed in this pull request? 1. Revert "Preparing development version 3.0.1-SNAPSHOT": `56dcd79` 2. Revert "Preparing Spark release v3.0.0-preview2-rc2": `c216ef1` ### Why are the changes needed? Shouldn't change master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test: https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master Closes #26915 from wangyum/revert-master. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-12-16 19:57:44 -07:00
Yuming Wang	56dcd79992	Preparing development version 3.0.1-SNAPSHOT	2019-12-17 01:57:27 +00:00
Yuming Wang	c216ef1d03	Preparing Spark release v3.0.0-preview2-rc2	2019-12-17 01:57:21 +00:00
Maxim Gekk	b03ce63c05	[SPARK-30258][TESTS] Eliminate warnings of deprecated Spark APIs in tests ### What changes were proposed in this pull request? In the PR, I propose to move all tests that use deprecated Spark APIs to separate test classes, and add the annotation: ```scala deprecated("This test suite will be removed.", "3.0.0") ``` The annotation suppress warnings from already deprecated methods and classes. ### Why are the changes needed? The warnings about deprecated Spark APIs in tests does not indicate any issues because the tests use such APIs intentionally. Eliminating the warnings allows to highlight other warnings that could show real problems. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites and by - DeprecatedAvroFunctionsSuite - DeprecatedDateFunctionsSuite - DeprecatedDatasetAggregatorSuite - DeprecatedStreamingAggregationSuite - DeprecatedWholeStageCodegenSuite Closes #26885 from MaxGekk/eliminate-deprecate-warnings. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-16 18:24:32 -06:00
Fokko Driesprong	99ea324b6f	[SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas Follow up of https://github.com/apache/spark/pull/24405 ### What changes were proposed in this pull request? The current implementation of _from_avro_ and _AvroDataToCatalyst_ doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized. The proposed change is to add a new option `actualSchema` to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to _GenericDatumReader_. If no writer's schema is provided, nothing changes from before. ### Why are the changes needed? Consider the following example. ``` // schema ID: 1 val schema1 = """ { "type": "record", "name": "MySchema", "fields": [ {"name": "col1", "type": "int"}, {"name": "col2", "type": "string"} ] } """ // schema ID: 2 val schema2 = """ { "type": "record", "name": "MySchema", "fields": [ {"name": "col1", "type": "int"}, {"name": "col2", "type": "string"}, {"name": "col3", "type": "string", "default": ""} ] } """ ``` The two schemas are compatible - i.e. you can use `schema2` to deserialize events serialized with `schema1`, in which case there will be the field `col3` with the default value. Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. We want to combine them into one dataframe for storing or further processing. With the current `from_avro` function we can only decode each of them with the corresponding schema: ``` scalaval df1 = ... // Avro events created with schema1 df1: org.apache.spark.sql.DataFrame = [eventBytes: binary] scalaval decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded") decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>] scalaval df2= ... // Avro events created with schema2 df2: org.apache.spark.sql.DataFrame = [eventBytes: binary] scalaval decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded") decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>] ``` but then `decodedDf1` and `decodedDf2` have different Spark schemas and we can't union them. Instead, with the proposed change we can decode `df1` in the following way: ``` scalaimport scala.collection.JavaConverters._ scalaval decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("actualSchema" -> schema1).asJava) as "decoded") decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>] ``` so that both dataframes have the same schemas and can be merged. ### Does this PR introduce any user-facing change? This PR allows users to pass a new configuration but it doesn't affect current code. ### How was this patch tested? A new unit test was added. Closes #26780 from Fokko/SPARK-27506. Lead-authored-by: Fokko Driesprong <fokko@apache.org> Co-authored-by: Gianluca Amori <gianluca.amori@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-11 01:26:29 -08:00
Jungtaek Lim (HeartSaVioR)	538b8d101c	[SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports ### What changes were proposed in this pull request? This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate. ### Why are the changes needed? Java linter starts complaining. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? lint-java passed locally This closes #26819 Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-09 08:57:20 -08:00
Gengliang Wang	a717d219a6	[SPARK-30159][SQL][TESTS] Fix the method calls of `QueryTest.checkAnswer` ### What changes were proposed in this pull request? Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen. The actual exceptions are thrown in the trait `QueryTest`. However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers. After this PR: 1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results. 2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message. 3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`. ### Why are the changes needed? We should fix these method calls to perform actual validations in test suites. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26788 from gengliangwang/fixCheckAnswer. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-09 22:19:08 +09:00
Kent Yao	9562b26914	[SPARK-29757][SQL] Move calendar interval constants together ### What changes were proposed in this pull request? ```java public static final int YEARS_PER_DECADE = 10; public static final int YEARS_PER_CENTURY = 100; public static final int YEARS_PER_MILLENNIUM = 1000; public static final byte MONTHS_PER_QUARTER = 3; public static final int MONTHS_PER_YEAR = 12; public static final byte DAYS_PER_WEEK = 7; public static final long DAYS_PER_MONTH = 30L; public static final long HOURS_PER_DAY = 24L; public static final long MINUTES_PER_HOUR = 60L; public static final long SECONDS_PER_MINUTE = 60L; public static final long SECONDS_PER_HOUR = MINUTES_PER_HOUR * SECONDS_PER_MINUTE; public static final long SECONDS_PER_DAY = HOURS_PER_DAY * SECONDS_PER_HOUR; public static final long MILLIS_PER_SECOND = 1000L; public static final long MILLIS_PER_MINUTE = SECONDS_PER_MINUTE * MILLIS_PER_SECOND; public static final long MILLIS_PER_HOUR = MINUTES_PER_HOUR * MILLIS_PER_MINUTE; public static final long MILLIS_PER_DAY = HOURS_PER_DAY * MILLIS_PER_HOUR; public static final long MICROS_PER_MILLIS = 1000L; public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * MICROS_PER_MILLIS; public static final long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * MICROS_PER_SECOND; public static final long MICROS_PER_HOUR = MINUTES_PER_HOUR * MICROS_PER_MINUTE; public static final long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR; public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY; /* 365.25 days per year assumes leap year every four years / public static final long MICROS_PER_YEAR = (36525L MICROS_PER_DAY) / 100; public static final long NANOS_PER_MICROS = 1000L; public static final long NANOS_PER_MILLIS = MICROS_PER_MILLIS * NANOS_PER_MICROS; public static final long NANOS_PER_SECOND = MILLIS_PER_SECOND * NANOS_PER_MILLIS; ``` The above parameters are defined in IntervalUtils, DateTimeUtils, and CalendarInterval, some of them are redundant, some of them are cross-referenced. ### Why are the changes needed? To simplify code, enhance consistency and reduce risks ### Does this PR introduce any user-facing change? no ### How was this patch tested? modified uts Closes #26399 from yaooqinn/SPARK-29757. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-07 19:48:19 +08:00
Xingbo Jiang	8207c835b4	Revert "Prepare Spark release v3.0.0-preview-rc2" This reverts commit `007c873ae3`.	2019-10-30 17:45:44 -07:00
Xingbo Jiang	007c873ae3	Prepare Spark release v3.0.0-preview-rc2 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the sparkR version number check logic to allow jvm version like `3.0.0-preview` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A	2019-10-30 17:42:59 -07:00
Xingbo Jiang	b33a58c0c6	Revert "Prepare Spark release v3.0.0-preview-rc1" This reverts commit `5eddbb5f1d`.	2019-10-28 22:32:34 -07:00
Xingbo Jiang	5eddbb5f1d	Prepare Spark release v3.0.0-preview-rc1 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the PySpark version from `3.0.0.dev0` to `3.0.0` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26243 from jiangxb1987/3.0.0-preview-prepare. Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-10-28 22:31:29 -07:00
Dongjoon Hyun	a0b3d7a323	[SPARK-29300][TESTS] Compare `catalyst` and `avro` module benchmark in JDK8/11 ### What changes were proposed in this pull request? This PR regenerate the benchmark results in `catalyst` and `avro` module in order to compare JDK8/JDK11 result. ### Why are the changes needed? This PR aims to verify that there is no regression on JDK11. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only update. We need to run the benchmark manually. Closes #25972 from dongjoon-hyun/SPARK-29300. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 17:59:43 -07:00

1 2 3

120 commits