ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
yi.wu	546d2eb5d4	[SPARK-34384][CORE] Add missing docs for ResourceProfile APIs ### What changes were proposed in this pull request? This PR adds missing docs for ResourceProfile related APIs. Besides, it includes a few minor changes on API: * ResourceProfileBuilder.build -> ResourceProfileBuilder.builder() * Provides java specific API `allSupportedExecutorResourcesJList` * private `ResourceAllocator` since it was mistakenly exposed previously ### Why are the changes needed? Add missing API docs ### Does this PR introduce _any_ user-facing change? No, as Apache Spark 3.1 hasn't officially released. ### How was this patch tested? Updated unit tests due to the signature change of `build()`. Closes #31496 from Ngone51/resource-profile-api-cleanup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 18:29:44 +09:00
Max Gekk	04c3125dcf	[SPARK-34360][SQL] Support truncation of v2 tables ### What changes were proposed in this pull request? 1. Add new interface `TruncatableTable` which represents tables that allow atomic truncation. 2. Implement new method in `InMemoryTable` and in `InMemoryPartitionTable`. ### Why are the changes needed? To support `TRUNCATE TABLE` for v2 tables. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? Added new tests to `TableCatalogSuite` that check truncation of non-partitioned and partitioned tables: ``` $ build/sbt "test:testOnly *TableCatalogSuite" ``` Closes #31475 from MaxGekk/dsv2-truncate-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 17:50:38 +09:00
Kent Yao	1fac706db5	[SPARK-34373][SQL] HiveThriftServer2 startWithContext may hang with a race issue ### What changes were proposed in this pull request? fix a race issue by interrupting the thread ### Why are the changes needed? ``` 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: No underlying server socket. at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.acceException in thread "Thread-15" java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) at java.io.BufferedInputStream.read(BufferedInputStream.java:336) at java.io.FilterInputStream.read(FilterInputStream.java:107) at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) ``` when the TServer try to `serve` after `stop`, it hangs with the log above forever ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing ci Closes #31479 from yaooqinn/SPARK-34373. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 17:37:12 +09:00
Gera Shegalov	fadd0f5d9b	[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator This PR is a fix for the JLS 17.5.3 violation identified in zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA. ### What changes were proposed in this pull request? - Use a var field to hold the state of the collection accumulator ### Why are the changes needed? AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis https://github.com/NVIDIA/spark-rapids/issues/1522. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. - By trial and error I crafted a command https://github.com/NVIDIA/spark-rapids/pull/1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours - existing unit tests in `AccumulatorV2Suite` and `LiveEntitySuite` Closes #31540 from gerashegalov/SPARK-20977. Authored-by: Gera Shegalov <gera@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-20 20:57:14 -06:00
Yuchen Huo	7de49a8fc0	[SPARK-34481][SQL] Refactor dataframe reader/writer optionsWithPath logic ### What changes were proposed in this pull request? Extract optionsWithPath logic into their own function. ### Why are the changes needed? Reduce the code duplication and improve modularity. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just some refactoring. Existing tests. Closes #31599 from yuchenhuo/SPARK-34481. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-20 17:57:43 -08:00
Kousuke Saruta	82b33a3041	[SPARK-34379][SQL] Map JDBC RowID to StringType rather than LongType ### What changes were proposed in this pull request? This PR fix an issue that `java.sql.RowId` is mapped to `LongType` and prefer `StringType`. In the current implementation, JDBC RowID type is mapped to `LongType` except for `OracleDialect`, but there is no guarantee to be able to convert RowID to long. `java.sql.RowId` declares `toString` and the specification of `java.sql.RowId` says > _all methods on the RowId interface must be fully implemented if the JDBC driver supports the data type_ (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html) So, we should prefer StringType to LongType. ### Why are the changes needed? This seems to be a potential bug. ### Does this PR introduce _any_ user-facing change? Yes. RowID is mapped to StringType rather than LongType. ### How was this patch tested? New test and the existing test case `SPARK-32992: map Oracle's ROWID type to StringType` in `OracleIntegrationSuite` passes. Closes #31491 from sarutak/rowid-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-20 23:45:56 +09:00
Sean Owen	f78466dca6	[SPARK-7768][CORE][SQL] Open UserDefinedType as a Developer API ### What changes were proposed in this pull request? UserDefinedType and UDTRegistration become public Developer APIs, not package-private to Spark. ### Why are the changes needed? This proposes to simply open up the UserDefinedType class as a developer API. It was public in 1.x, but closed in 2.x for some possible redesign that does not seem to have happened. Other libraries have managed to define UDTs anyway by inserting shims into the Spark namespace, and this evidently has worked OK. But package isolation in Java 9+ breaks this. The logic here is mostly: this is de facto a stable API, so can at least be open to developers with the usual caveats about developer APIs. Open questions: - Is there in fact some important redesign that's needed before opening it? The comment to this effect is from 2016 - Is this all that needs to be opened up? Like PythonUserDefinedType? - Should any of this be kept package-private? This was first proposed in https://github.com/apache/spark/pull/16478 though it was a larger change, but, the other API issues it was fixing seem to have been addressed already (e.g. no need to return internal Spark types). It was never really reviewed. My hunch is that there isn't much downside, and some upside, to just opening this as-is now. ### Does this PR introduce _any_ user-facing change? UserDefinedType becomes visible to developers to subclass. ### How was this patch tested? Existing tests; there is no change to the existing logic. Closes #31461 from srowen/SPARK-7768. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-20 07:32:06 -06:00
Bo Zhang	489d32aa9b	[SPARK-34471][SS][DOCS] Document Streaming Table APIs in Structured Streaming Programming Guide ### What changes were proposed in this pull request? This change is to document the newly added streaming table APIs in Structured Streaming Programming Guide. ### Why are the changes needed? This will help our users when they try to use the new APIs. ### Does this PR introduce _any_ user-facing change? Yes. Users will see the changes in the programming guide. ### How was this patch tested? Built the HTML page and verified. Attached is a screenshot of the section added: ![Table APIs Section - Scala](https://user-images.githubusercontent.com/44179472/108581923-1ff86700-736b-11eb-8fcd-efa04ac936de.png) Closes #31590 from bozhang2820/table-api-doc. Lead-authored-by: Bo Zhang <bo.zhang@databricks.com> Co-authored-by: Bo Zhang <bozhang2820@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-02-20 15:54:43 +09:00
yi.wu	4dc16f2d59	[SPARK-24818][CORE] Support delay scheduling for barrier execution ### What changes were proposed in this pull request? This PR tries to support the (non-legacy) delay scheduling for the barrier execution. The idea is, adding a pending launch tasks list(`barrierPendingLaunchTasks`) in the barrier `TaskSetManager`. And we don't really add those pending launch tasks to the running list and post task start event to the listeners and so on until all tasks in the barrier `TaskSetManager` has been added to `barrierPendingLaunchTasks` after a single round `resourceOffers()`. If there're only partial tasks that are able to launch after a single `rousourceOffers()` round, we'll revert all the assigned resources to those tasks which were added in `barrierPendingLaunchTasks` and clear `barrierPendingLaunchTasks` and wait for the next `resourceOffers()` round. The barrier `TaskSetManager` should be launched finally since we've ensured enough slots before the scheduling. ### Why are the changes needed? Currently, with delay scheduling enabled for the barrier execution, the application can abort immediately when there're only partial tasks can be launched. This is really bad, especially when the application already completed many stages before the barrier stage. For example, the application may do some ETL jobs before the barrier job(for ML). After this PR, this scenario should no longer happen. ### Does this PR introduce _any_ user-facing change? Yes, users will no longer face the `Fail resource offers for barrier stage...` error. ### How was this patch tested? Added/updated unit tests. Closes #30650 from Ngone51/barrier-delay-scheduling. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-02-19 16:04:44 -06:00
Dongjoon Hyun	484a83e73e	[SPARK-34469][K8S] Ignore RegisterExecutor when SparkContext is stopped ### What changes were proposed in this pull request? This PR aims to make `KubernetesClusterSchedulerBackend` ignore `RegisterExecutor` message when `SparkContext` is stopped already. ### Why are the changes needed? If `SparkDriver` is terminated, the executors will be removed by K8s automatically. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the newly added test case. Closes #31587 from dongjoon-hyun/SPARK-34469. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-19 09:36:07 -08:00
Zhichao Zhang	96bcb4bbe4	[SPARK-34283][SQL] Combines all adjacent 'Union' operators into a single 'Union' when using 'Dataset.union.distinct.union.distinct' ### What changes were proposed in this pull request? Handled 'Deduplicate(Keys, Union)' operation in rule 'CombineUnions' to combine adjacent 'Union' operators into a single 'Union' if necessary when using 'Dataset.union.distinct.union.distinct'. Currently only handle distinct-like 'Deduplicate', where the keys == output, for example: ``` val df1 = Seq((1, 2, 3)).toDF("a", "b", "c") val df2 = Seq((6, 2, 5)).toDF("a", "b", "c") val df3 = Seq((2, 4, 3)).toDF("c", "a", "b") val df4 = Seq((1, 4, 5)).toDF("b", "a", "c") val unionDF1 = df1.unionByName(df2).dropDuplicates(Seq("b", "a", "c")) .unionByName(df3).dropDuplicates().unionByName(df4) .dropDuplicates("a") ``` In this case, all Union operators will be combined. but, ``` val df1 = Seq((1, 2, 3)).toDF("a", "b", "c") val df2 = Seq((6, 2, 5)).toDF("a", "b", "c") val df3 = Seq((2, 4, 3)).toDF("c", "a", "b") val df4 = Seq((1, 4, 5)).toDF("b", "a", "c") val unionDF = df1.unionByName(df2).dropDuplicates(Seq("a")) .unionByName(df3).dropDuplicates("c").unionByName(df4) .dropDuplicates("b") ``` In this case, all unions will not be combined, because the Deduplicate.keys doesn't equal to Union.output. ### Why are the changes needed? When using 'Dataset.union.distinct.union.distinct', the operator is 'Deduplicate(Keys, Union)', but AstBuilder transform sql-style 'Union' to operator 'Distinct(Union)', the rule 'CombineUnions' in Optimizer only handle 'Distinct(Union)' operator but not Deduplicate(Keys, Union). Please see the detailed description in [SPARK-34283](https://issues.apache.org/jira/browse/SPARK-34283). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #31404 from zzcclp/SPARK-34283. Authored-by: Zhichao Zhang <441586683@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 15:19:13 +00:00
gengjiaan	06df1210d4	[SPARK-28123][SQL] String Functions: support btrim ### What changes were proposed in this pull request? Spark support `trim`/`ltrim`/`rtrim` now. The function `btrim` is an alternate form of `TRIM(BOTH <chars> FROM <expr>)`. `btrim` removes the longest string consisting only of specified characters from the start and end of a string. The mainstream database support this feature show below: Postgresql https://www.postgresql.org/docs/11/functions-binarystring.html Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/BTRIM.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_____5 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_BTRIM.html Druid https://druid.apache.org/docs/latest/querying/sql.html#string-functions Greenplum http://docs.greenplum.org/6-8/ref_guide/function-summary.html ### Why are the changes needed? btrim is very useful. ### Does this PR introduce _any_ user-facing change? Yes. btrim is a new function ### How was this patch tested? Jenkins test. Closes #31390 from beliefer/SPARK-28123-support-btrim. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 13:28:49 +00:00
Peter Toth	27abb6ab56	[SPARK-34421][SQL] Resolve temporary functions and views in views with CTEs ### What changes were proposed in this pull request? This PR: - Fixes a bug that prevents analysis of: ``` CREATE TEMPORARY VIEW temp_view AS WITH cte AS (SELECT temp_func(0)) SELECT * FROM cte; SELECT * FROM temp_view ``` by throwing: ``` Undefined function: 'temp_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'. ``` - and doesn't report analysis error when it should: ``` CREATE TEMPORARY VIEW temp_view AS SELECT 0; CREATE VIEW view_on_temp_view AS WITH cte AS (SELECT * FROM temp_view) SELECT * FROM cte ``` by properly collecting temporary objects from VIEW definitions with CTEs. - Minor refactor to make the affected code more readable. ### Why are the changes needed? To fix a bug introduced with https://github.com/apache/spark/pull/30567 ### Does this PR introduce _any_ user-facing change? Yes, the query works again. ### How was this patch tested? Added new UT + existing ones. Closes #31550 from peter-toth/SPARK-34421-temp-functions-in-views-with-cte. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 18:14:49 +08:00
Max Gekk	b26e7b510b	[SPARK-34314][SQL] Fix partitions schema inference ### What changes were proposed in this pull request? Infer the partitions schema by: 1. interring the common type over all partition part values, and 2. casting those values to the common type Before the changes: 1. Spark creates a literal with most appropriate type for concrete partition value i.e. `part0=-0` -> `Literal(0, IntegerType)`, `part0=abc` -> `Literal(UTF8String.fromString("abc"), StringType)`. 2. Finds the common type for all literals of a partition column. For the example above, it is `StringType`. 3. Casts those literal to the desired type: - `Cast(Literal(0, IntegerType), StringType)` -> `UTF8String.fromString("0")` - `Cast(Literal(UTF8String.fromString("abc", StringType), StringType)` -> `UTF8String.fromString("abc")` In the example, we get a partition part value "0" which is different from the original one "-0". Spark shouldn't modify partition part values of the string type because it can influence on query results. Closes #31423 ### Why are the changes needed? The changes fix the bug demonstrated by the example: 1. There are partitioned parquet files (file format doesn't matter): ``` /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d ├── _SUCCESS ├── part=-0 │ └── part-00001-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet └── part=AA └── part-00000-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet ``` placed to two partitions "AA" and "-0". 2. When reading them w/o specified schema: ``` val df = spark.read.parquet(path) df.printSchema() root \|-- id: integer (nullable = true) \|-- part: string (nullable = true) ``` the inferred type of the partition column `part` is the string type. 3. The expected values in the column `part` are "AA" and "-0" but we get: ``` df.show(false) +---+----+ \|id \|part\| +---+----+ \|0 \|AA \| \|1 \|0 \| +---+----+ ``` So, Spark returns "0" instead of "-0". ### Does this PR introduce _any_ user-facing change? This PR can change query results. ### How was this patch tested? By running new test and existing test suites: ``` $ build/sbt "test:testOnly FileIndexSuite" $ build/sbt "test:testOnly ParquetV1PartitionDiscoverySuite" $ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite" ``` Closes #31549 from MaxGekk/fix-partition-file-index-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 08:36:13 +00:00
Max Gekk	4a9a1d42e7	[SPARK-34466][SQL][DOCS] Improve docs for `ALTER TABLE .. RENAME TO` ### What changes were proposed in this pull request? Explicitly highlight that the table rename command cannot move a table between databases. ### Why are the changes needed? To inform users about actual behavior of the table rename command. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```sql spark-sql> CREATE DATABASE db1; spark-sql> CREATE DATABASE db2; spark-sql> CREATE TABLE db1.tbl1 (c0 INT); spark-sql> ALTER TABLE db1.tbl1 RENAME TO db2.tbl1; Error in query: RENAME TABLE source and destination databases do not match: 'db1' != 'db2'; spark-sql> ALTER TABLE db1.tbl1 RENAME TO db1.tbl2; spark-sql> SHOW TABLES IN db1 LIKE '*'; db1 tbl2 false ``` Closes #31586 from MaxGekk/doc-rename-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 04:48:16 +00:00
yzjg	26548edfa2	[MINOR][SQL][DOCS] Fix the comments in the example at window function ### What changes were proposed in this pull request? `functions.scala` window function has an comment error in the field name. The column should be `time` per `timestamp:TimestampType`. ### Why are the changes needed? To deliver the correct documentation and examples. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing docs. ### How was this patch tested? CI builds in this PR should test the documentation build. Closes #31582 from yzjg/yzjg-patch-1. Authored-by: yzjg <785246661@qq.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-19 10:45:21 +09:00
Max Gekk	cad469d47a	[SPARK-34465][SQL] Rename v2 alter table exec nodes ### What changes were proposed in this pull request? Rename the following v2 exec nodes: - AlterTableAddPartitionExec -> AddPartitionExec - AlterTableRenamePartitionExec -> RenamePartitionExec - AlterTableDropPartitionExec -> DropPartitionExec ### Why are the changes needed? - To be consistent with v2 exec node added before: ALTER TABLE .. RENAME TO` -> RenameTableExec. - For simplicity and readability of the execution plans. ### Does this PR introduce _any_ user-facing change? Should not since this is internal API. ### How was this patch tested? By running the existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31584 from MaxGekk/rename-alter-table-exec-nodes. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-18 14:33:26 -08:00
Dongjoon Hyun	331c6fd4ef	[SPARK-34467][BUILD] Upgrade Zstd-jni to 1.4.8-4 ### What changes were proposed in this pull request? This PR aims to upgrade Zstd-JNI library to 1.4.8-4 to bring JNI side optimization. `ZStandardBenchmark` shows that there is no regression in terms of performance and show some improvements. ### Why are the changes needed? https://github.com/luben/zstd-jni/commits/v1.4.8-4 - `be9be47fae` - `be51ebade1` - `44ff8b6f95` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31585 from dongjoon-hyun/SPARK-ZSTD-1.4.8-4. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-18 13:35:49 -08:00
Max Gekk	8f7ec4b28e	[SPARK-34454][SQL] Mark legacy SQL configs as internal ### What changes were proposed in this pull request? 1. Make the following SQL configs as internal: - spark.sql.legacy.allowHashOnMapType - spark.sql.legacy.sessionInitWithConfigDefaults 2. Add a test to check that all SQL configs from the `legacy` namespace are marked as internal configs. ### Why are the changes needed? Assuming that legacy SQL configs shouldn't be set by users in common cases. The purpose of such configs is to allow switching to old behavior in corner cases. So, the configs should be marked as internals. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *SQLConfSuite" ``` Closes #31577 from MaxGekk/mark-legacy-configs-as-internal. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-18 10:39:51 -08:00
Chao Sun	27873280ff	[SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase ### What changes were proposed in this pull request? Currently in `SpecificParquetRecordReaderBase` we use deprecated APIs in a few places from Parquet, such as `readFooter`, `ParquetInputSplit`, `new ParquetFileReader`, `filterRowGroups`, etc. This replaces these with the newer APIs. In specific this: - Replaces `ParquetInputSplit` with `FileSplit`. We never use specific things in the former such as `rowGroupOffsets` so the swap is pretty simple. - Removes `readFooter` calls by using `ParquetFileReader.open` - Replace deprecated `ParquetFileReader` ctor with the newer API which takes `ParquetReadOptions`. - Removes the unnecessary handling of case when `rowGroupOffsets` is not null. It seems this never happens. ### Why are the changes needed? The aforementioned APIs were deprecated and is going to be removed at some point in future. This is to ensure better supportability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a cleanup and relies on existing tests on the relevant code paths. Closes #29542 from sunchao/SPARK-32703. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-18 10:18:14 -06:00
Steve Loughran	ff5115c3ac	[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-02-18 08:43:18 -06:00
gengjiaan	edccf96cad	[SPARK-34394][SQL] Unify output of SHOW FUNCTIONS and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The output attributes of `ShowFunctions` does't pass to `ShowFunctionsCommand` properly. As the query plan, this PR pass the output attributes from `ShowFunctions` to `ShowFunctionsCommand`. ### Why are the changes needed? This PR pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31519 from beliefer/SPARK-34394. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-18 12:50:50 +00:00
gengjiaan	c925e4d0fd	[SPARK-34393][SQL] Unify output of SHOW VIEWS and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The output attributes of `ShowViews` does't pass to `ShowViewsCommand` properly. As the query plan, this PR pass the output attributes from `ShowViews` to `ShowViewsCommand`. ### Why are the changes needed? This PR pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31508 from beliefer/SPARK-34393. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-18 12:48:39 +00:00
Kousuke Saruta	5167228172	[SPARK-34449][BUILD] Upgrade Jetty to fix CVE-2020-27218 ### What changes were proposed in this pull request? This PR upgrades Jetty from `9.4.34` to `9.4.36`. ### Why are the changes needed? CVE-2020-27218 affects currently used Jetty 9.4.34. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27218 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified existing test and new test which comply with the new version of Jetty. Closes #31574 from sarutak/upgrade-jetty-9.4.36. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 18:02:34 +09:00
Max Gekk	b58f0976a9	[SPARK-34437][SQL][DOCS] Update Spark SQL guide about the rebasing DS options and SQL configs ### What changes were proposed in this pull request? In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing: - spark.sql.parquet.int96RebaseModeInWrite - spark.sql.parquet.datetimeRebaseModeInWrite - spark.sql.parquet.int96RebaseModeInRead - spark.sql.parquet.datetimeRebaseModeInRead - spark.sql.avro.datetimeRebaseModeInWrite - spark.sql.avro.datetimeRebaseModeInRead Parquet options added by #31489: - datetimeRebaseMode - int96RebaseMode and Avro options added by #31529: - datetimeRebaseMode <img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png"> ### Why are the changes needed? To inform users about supported DS options and SQL configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By generating the doc and manually checking: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31564 from MaxGekk/doc-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 17:48:50 +09:00
Max Gekk	7b549c3e53	[SPARK-34455][SQL] Deprecate `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` ### What changes were proposed in this pull request? 1. Put the SQL config `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` to the list of deprecated configs `deprecatedSQLConfigs` 2. Update docs for the Avro datasource <img width="982" alt="Screenshot 2021-02-17 at 21 04 26" src="https://user-images.githubusercontent.com/1580697/108249890-abed7180-7166-11eb-8cb7-0c246d2a34fc.png"> ### Why are the changes needed? The config exists for enough time. We can deprecate it, and recommend users to use `.format("avro")` instead. ### Does this PR introduce _any_ user-facing change? Should not except of the warning with the recommendation to use the `avro` format. ### How was this patch tested? 1. By generating docs via: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` 2. Manually checking the warning: ``` scala> spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", false) 21/02/17 21:20:18 WARN SQLConf: The SQL config 'spark.sql.legacy.replaceDatabricksSparkAvro.enabled' has been deprecated in Spark v3.2 and may be removed in the future. Use `.format("avro")` in `DataFrameWriter` or `DataFrameReader` instead. ``` Closes #31578 from MaxGekk/deprecate-replaceDatabricksSparkAvro. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 21:54:20 -08:00
“attilapiros”	bdcad33d8b	[SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler ### What changes were proposed in this pull request? Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler. Some files and their responsibilities within this PR: - `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access - `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions - `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19". After this the correct process to update Jekyll version is the following: 1. update the version in `Gemfile` 2. run "bundle update" which updates the `Gemfile.lock` 3. commit both files This process for version update is tested for details please check the testing section. ### Why are the changes needed? Using different Jekyll versions can generate different output documents. This PR standardize the process. ### Does this PR introduce _any_ user-facing change? No, assuming the release was done via docker by using `do-release-docker.sh`. In that case there should be no difference at all as the same Jekyll version is specified in the Gemfile. ### How was this patch tested? #### Testing document generation Doc generation step was triggered via the docker release: ``` $ ./do-release-docker.sh -d ~/working -n -s docs ... ======================== = Building documentation... Command: /opt/spark-rm/release-build.sh docs Log file: docs.log Skipping publish step. ``` The docs.log contains the followings: ``` Building Spark docs Fetching gem metadata from https://rubygems.org/......... Using bundler 2.2.9 Fetching rb-fsevent 0.10.4 Fetching forwardable-extended 2.6.0 Fetching public_suffix 4.0.6 Fetching colorator 1.1.0 Fetching eventmachine 1.2.7 Fetching http_parser.rb 0.6.0 Fetching ffi 1.14.2 Fetching concurrent-ruby 1.1.8 Installing colorator 1.1.0 Installing forwardable-extended 2.6.0 Installing rb-fsevent 0.10.4 Installing public_suffix 4.0.6 Installing http_parser.rb 0.6.0 with native extensions Installing eventmachine 1.2.7 with native extensions Installing concurrent-ruby 1.1.8 Fetching rexml 3.2.4 Fetching liquid 4.0.3 Installing ffi 1.14.2 with native extensions Installing rexml 3.2.4 Installing liquid 4.0.3 Fetching mercenary 0.4.0 Installing mercenary 0.4.0 Fetching rouge 3.26.0 Installing rouge 3.26.0 Fetching safe_yaml 1.0.5 Installing safe_yaml 1.0.5 Fetching unicode-display_width 1.7.0 Installing unicode-display_width 1.7.0 Fetching webrick 1.7.0 Installing webrick 1.7.0 Fetching pathutil 0.16.2 Fetching kramdown 2.3.0 Fetching terminal-table 2.0.0 Fetching addressable 2.7.0 Fetching i18n 1.8.9 Installing terminal-table 2.0.0 Installing pathutil 0.16.2 Installing i18n 1.8.9 Installing addressable 2.7.0 Installing kramdown 2.3.0 Fetching kramdown-parser-gfm 1.1.0 Installing kramdown-parser-gfm 1.1.0 Fetching rb-inotify 0.10.1 Fetching sassc 2.4.0 Fetching em-websocket 0.5.2 Installing rb-inotify 0.10.1 Installing em-websocket 0.5.2 Installing sassc 2.4.0 with native extensions Fetching listen 3.4.1 Installing listen 3.4.1 Fetching jekyll-watch 2.2.1 Installing jekyll-watch 2.2.1 Fetching jekyll-sass-converter 2.1.0 Installing jekyll-sass-converter 2.1.0 Fetching jekyll 4.2.0 Installing jekyll 4.2.0 Fetching jekyll-redirect-from 0.16.0 Installing jekyll-redirect-from 0.16.0 Bundle complete! 4 Gemfile dependencies, 30 gems now installed. Bundled gems are installed into `./.local_ruby_bundle` ``` #### Testing Jekyll (or other gem) update First locally I reverted Jekyll to 4.1.0: ``` $ rm Gemfile.lock $ rm -rf .local_ruby_bundle # edited Gemfile to use version 4.1.0 $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.1.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ bundle install ... ``` Testing Jekyll version before the update: ``` $ bundle exec jekyll --version jekyll 4.1.0 ``` Imitating Jekyll update coming from git by reverting my local changes: ``` $ git checkout Gemfile Updated 1 path from the index $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.2.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ git checkout Gemfile.lock Updated 1 path from the index ``` Run the install: ``` $ bundle install ... ``` Checking the updated Jekyll version: ``` $ bundle exec jekyll --version jekyll 4.2.0 ``` Closes #31559 from attilapiros/pin-jekyll-version. Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 12:17:57 +09:00
Anton Okolnychyi	1ad343238c	[SPARK-33736][SQL] Handle MERGE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR handles merge operations in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? These changes are needed to match what we already do for delete and update operations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends existing tests to cover merge operations. Closes #31579 from aokolnychyi/spark-33736. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 17:27:21 -08:00
Anton Okolnychyi	44a9aed0d7	[SPARK-34456][SQL] Remove unused write options from BatchWriteHelper ### What changes were proposed in this pull request? This PR removes dead code from `BatchWriteHelper` after SPARK-33808. ### Why are the changes needed? These changes simplify `BatchWriteHelper` by removing write options that are no longer needed as we build `Write` earlier. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31581 from aokolnychyi/simplify-batch-write-helper. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 17:25:26 -08:00
Cheng Su	a575e805a1	[SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + left semi) ### What changes were proposed in this pull request? Per discussion in https://issues.apache.org/jira/browse/SPARK-32883?focusedCommentId=17285057&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17285057, we should add documentation for added new features of full outer and left semi joins into SS programming guide. * Reworded the section for "Outer Joins with Watermarking", to make it work for full outer join. Updated the code snippet to show up full outer and left semi join. * Added one section for "Semi Joins with Watermarking", similar to "Outer Joins with Watermarking". * Updated "Support matrix for joins in streaming queries" to reflect latest fact for full outer and left semi join. ### Why are the changes needed? Good for users and developers to follow guide to try out these two new features. ### Does this PR introduce _any_ user-facing change? Yes. They will see the corresponding updated guide. ### How was this patch tested? No, just documentation change. Previewed the markdown file in browser. Also attached here for the change to the "Support matrix for joins in streaming queries" table. <img width="896" alt="Screen Shot 2021-02-16 at 8 12 07 PM" src="https://user-images.githubusercontent.com/4629931/108155275-73c92e80-7093-11eb-9f0b-c8b4bb7321e5.png"> Closes #31572 from c21/ss-doc. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-18 09:34:33 +09:00
“attilapiros”	76e5d75e36	[SPARK-33763] Add metrics for better tracking of dynamic allocation ### What changes were proposed in this pull request? This PR adds the following metrics to track executor remove reasons during dynamic allocation: - `numberExecutorsGracefullyDecommissioned`: number of executors which reached the finished decommissioning state and shut itself down cleanly - `numberExecutorsDecommissionUnfinished`: executors which requested to decommission but they stopped without reaching the finished decommissioning state - `numberExecutorsKilledByDriver`: executors killed by the driver (requested to stop) - `numberExecutorsExitedUnexpectedly`: executors exited without driver request ### Why are the changes needed? For supporting monitoring of dynamic allocation better with these metrics. ### Does this PR introduce _any_ user-facing change? Yes. The new metrics will be available for monitoring. ### How was this patch tested? With unit and integration tests. Finally manually checked the new metrics in jconsole: <img width="1054" alt="jmx" src="https://user-images.githubusercontent.com/2017933/107458686-de8adf00-6b54-11eb-86f7-41faf2fb638f.png"> Closes #31450 from attilapiros/SPARK-33763-final. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-17 13:44:36 -08:00
Max Gekk	5957bc18a1	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs ### What changes were proposed in this pull request? Move the datetime rebase SQL configs from the `legacy` namespace by: 1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`. 2. Add the legacy configs as alternatives 3. Deprecate the legacy rebase configs. ### Why are the changes needed? The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs. ### Does this PR introduce _any_ user-facing change? Should not. Users will see a warning if they still use one of the legacy configs. ### How was this patch tested? 1. Manually checking new configs: ```scala scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res0: String = EXCEPTION scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") 21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead. scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res2: String = LEGACY ``` 2. By running a datetime rebasing test suite: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31576 from MaxGekk/rebase-confs-alternatives. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-17 14:04:47 +00:00
Kousuke Saruta	dd6383f0a3	[SPARK-34333][SQL] Fix PostgresDialect to handle money types properly ### What changes were proposed in this pull request? This PR changes the type mapping for `money` and `money[]` types for PostgreSQL. Currently, those types are tried to convert to `DoubleType` and `ArrayType` of `double` respectively. But the JDBC driver seems not to be able to handle those types properly. https://github.com/pgjdbc/pgjdbc/issues/100 https://github.com/pgjdbc/pgjdbc/issues/1405 Due to these issue, we can get the error like as follows. money type. ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : 1,000.00 [info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104) [info] at org.postgresql.jdbc.PgResultSet.getDouble(PgResultSet.java:2432) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$5(JdbcUtils.scala:418) ``` money[] type. ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : $2,000.00 [info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104) [info] at org.postgresql.jdbc.ArrayDecoding$5.parseValue(ArrayDecoding.java:235) [info] at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:122) [info] at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:764) [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:310) [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:171) [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:111) ``` For money type, a known workaround is to treat it as string so this PR do it. For money[], however, there is no reasonable workaround so this PR remove the support. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? Yes. As of this PR merged, money type is mapped to `StringType` rather than `DoubleType` and the support for money[] is stopped. For money type, if the value is less than one thousand, `$100.00` for instance, it works without this change so I also updated the migration guide because it's a behavior change for such small values. On the other hand, money[] seems not to work with any value but mentioned in the migration guide just in case. ### How was this patch tested? New test. Closes #31442 from sarutak/fix-for-money-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-17 10:50:06 +09:00
“attilapiros”	5f91245cc2	[SPARK-34426][K8S][TESTS] Add driver and executors POD logs to integration tests log when the test fails ### What changes were proposed in this pull request? This PR introduces a new protected method in `SparkFunSuite` which is only called when the test failed and can be used to collect logs for failed test. By this PR it is implemented in the Kubernetes tests by `KubernetesSuite` class where it collects all the POD logs and logs them out. This unfortunately cannot be realized with a simple "after" method as in the "after" method the test outcome is not available. Moreover this PR removes the `appLocator` as a method argument as `appLocator` is available as a member variable. ### Why are the changes needed? Currently both the driver and executors logs are lost. In [developer-tools](https://spark.apache.org/developer-tools.html) there is a hint: "Getting logs from the pods and containers directly is an exercise left to the reader." But when the test is executed by Jenkins and a failure happened we really need the POD logs to analyze problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By integration testing. I have checked what would happen if one test fails, the output would be: ``` 21/02/14 11:05:34.261 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: ===== EXTRA LOGS FOR THE FAILED TEST 21/02/14 11:05:34.278 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver POD log ++ id -u + myuid=185 ++ id -g + mygid=0 + set +e ++ getent passwd 185 + uidentry= + set -e + '[' -z '' ']' + '[' -w /etc/passwd ']' + echo '185❌185:0:anonymous uid:/opt/spark:/bin/false' + SPARK_CLASSPATH=':/opt/spark/jars/' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]=$.$/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n '' ']' + '[' -z ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/' + case "$1" in + shift 1 + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$") + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/tests/decommissioning.py 21/02/14 10:02:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting decom test Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/02/14 10:02:29 INFO SparkContext: Running Spark version 3.2.0-SNAPSHOT 21/02/14 10:02:29 INFO ResourceUtils: ============================================================== 21/02/14 10:02:29 INFO ResourceUtils: No custom resources configured for spark.driver. 21/02/14 10:02:29 INFO ResourceUtils: ============================================================== ... 21/02/14 10:03:17 INFO ShutdownHookManager: Deleting directory /var/data/spark-fa6961ed-a2c1-444c-bfeb-20e63ba0b5cf/spark-ab4b0287-6e24-4b39-837e-9b0b62c1f26f 21/02/14 10:03:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-d6b11e7d-6a03-4a1d-8559-37cb853319bf 21/02/14 11:05:34.279 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver POD log ``` Closes #31561 from attilapiros/SPARK-34426. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-17 05:49:16 +09:00
Max Gekk	1a11fe5501	[SPARK-33210][SQL][DOCS][FOLLOWUP] Fix descriptions of the SQL configs for the parquet INT96 rebase modes ### What changes were proposed in this pull request? Fix descriptions of the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInWrite`, and mention `EXCEPTION` as the default value. ### Why are the changes needed? This fixes incorrect descriptions that can mislead users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #31557 from MaxGekk/int96-exception-by-default-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-16 11:55:53 +09:00
Max Gekk	03161055de	[SPARK-34424][SQL][TESTS] Fix failures of HiveOrcHadoopFsRelationSuite ### What changes were proposed in this pull request? Modify `RandomDataGenerator.forType()` to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example `1582-10-06`) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts `1582-10-06` to the next valid date `1582-10-15` while saving it to ORC files. And as a consequence of that, the test fails because it compares original date `1582-10-06` and the date `1582-10-15` loaded back from the ORC files. In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved. ### Why are the changes needed? The changes fix failures of `HiveOrcHadoopFsRelationSuite`. For instance, the test "test all data types" fails with the seed 610710213676: ``` == Results == !== Correct Answer - 20 == == Spark Answer - 20 == struct<index:int,col:date> struct<index:int,col:date> ... ![9,1582-10-06] [9,1582-10-15] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite" ``` Closes #31552 from MaxGekk/fix-HiveOrcHadoopFsRelationSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-16 11:53:26 +09:00
herman	4fd3247bca	[SPARK-34431][CORE] Only load `hive-site.xml` once ### What changes were proposed in this pull request? Lazily load Hive's configuration properties from `hive-site.xml` only once. ### Why are the changes needed? It is expensive to parse the same file over and over. ### Does this PR introduce _any_ user-facing change? Should not. The changes can improve performance slightly. ### How was this patch tested? By existing test suites such as `SparkContextSuite`. Closes #31556 from MaxGekk/load-hive-site-once. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-15 09:31:51 -08:00
Max Gekk	aca6db1868	[SPARK-34434][SQL] Mention DS rebase options in `SparkUpgradeException` ### What changes were proposed in this pull request? Mention the DS options introduced by https://github.com/apache/spark/pull/31529 and by https://github.com/apache/spark/pull/31489 in `SparkUpgradeException`. ### Why are the changes needed? To improve user experience with Spark SQL. Before the changes, the error message recommends to set SQL configs but the configs cannot help in the some situations (see the PRs for more details). ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the error message is: _org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. To read the datetime values as it is, set the SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' or the datasource option 'datetimeRebaseMode' to 'CORRECTED'._ ### How was this patch tested? 1. By checking coding style: `./dev/scalastyle` 2. By running the related test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31562 from MaxGekk/rebase-upgrade-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-14 17:42:15 -08:00
oraviv	3d39dfa8c3	[SPARK-34416][SQL] Adding support for user provided schema url in Avro ### What changes were proposed in this pull request? Added option to provide Avro schema by URL. ### Why are the changes needed? (copied from Jira ticket) We have a use case in which we read a huge table in Avro format. About 30k columns. using the default Hive reader - `AvroGenericRecordReader` it is just hangs forever. after 4 hours not even one task has finished. We tried instead to use `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema .. at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85) at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) ... 53 elided ``` because files schema contain duplicate column names (when considering case-insensitive). So we wanted to provide a user schema with non-duplicated fields, but the schema is huge. a few MBs. it is not practical to provide it in json format. So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to `avroSchema` and it worked perfectly. ### How was this patch tested? added a unitest to AvroSuite and tested locally with patched version Closes #31543 from uzadude/avro_schema. Authored-by: oraviv <oraviv@paypal.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-14 13:43:57 -08:00
Eric Lemmon	e3b6e4ad43	[SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs ### What changes were proposed in this pull request? Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well. ### Why are the changes needed? No docs were generated for `pyspark.sql.conf.RuntimeConfig`. ### Does this PR introduce _any_ user-facing change? Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch. ### How was this patch tested? First built the Python docs: ```bash cd $SPARK_HOME/docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve ``` Then verified all pages and links: 1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page: http://localhost:4000/api/python/reference/index.html ![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png) 2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page: http://localhost:4000/api/python/reference/pyspark.sql.html#configuration ![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)** 3. RuntimeConfig page displayed: http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html ![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png) 4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page: http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html ![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png) Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable. Authored-by: Eric Lemmon <eric@lemmon.cc> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-13 09:32:55 -06:00
William Hyun	f2e1468496	[SPARK-34428][BUILD] Update sbt version to 1.4.7 ### What changes were proposed in this pull request? This PR aims to update the sbt version to 1.4.7. ### Why are the changes needed? This will bring the latest bug fixes and improvements. - https://github.com/sbt/sbt/releases/tag/v1.4.7 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31555 from williamhyun/sbt147. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-12 18:38:08 -08:00
Terry Kim	9a566f83a0	[SPARK-34380][SQL] Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES for v2 command ### What changes were proposed in this pull request? This PR proposes to support `ifExists` flag for v2 `ALTER TABLE ... UNSET TBLPROPERTIES` command. Currently, the flag is not respected and the command behaves as `ifExists = true` where the command always succeeds when the properties do not exist. ### Why are the changes needed? To support `ifExists` flag and align with v1 command behavior. ### Does this PR introduce _any_ user-facing change? Yes, now if the property does not exist and `IF EXISTS` is not specified, the command will fail: ``` ALTER TABLE t UNSET TBLPROPERTIES ('unknown') // Fails with "Attempted to unset non-existent property 'unknown'" ALTER TABLE t UNSET TBLPROPERTIES IF EXISTS ('unknown') // OK ``` ### How was this patch tested? Added new test Closes #31494 from imback82/AlterTableUnsetPropertiesIfExists. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-12 17:42:43 -08:00
Max Gekk	91be583fb8	[SPARK-34418][SQL][TESTS] Check partitions existence after v1 TRUNCATE TABLE ### What changes were proposed in this pull request? Add a test and modify an existing one to check that partitions still exist after v1 `TRUNCATE TABLE`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *TruncateTableSuite" ``` Closes #31544 from MaxGekk/test-truncate-partitioned-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-12 15:57:42 -08:00
Liang-Chi Hsieh	e0053853c9	[SPARK-34420][SQL] Throw exception if non-streaming Deduplicate is not replaced by aggregate ### What changes were proposed in this pull request? This patch proposes to throw exception if non-streaming `Deduplicate` is not replaced by aggregate in query planner. ### Why are the changes needed? We replace some operations in the query optimizer. For them we throw some exceptions accordingly in query planner if these logical nodes are not replaced. But `Deduplicate` is missing and it opens a possible hole. For code consistency and to prevent possible unexpected query planning error, we should add similar exception case to query planner. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #31547 from viirya/minor-deduplicate. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-10 22:40:51 -08:00
HyukjinKwon	92a83463c9	[SPARK-34408][PYTHON] Refactor spark.udf.register to share the same path to generate UDF instance ### What changes were proposed in this pull request? This PR proposes to use `_create_udf` where we need to create `UserDefinedFunction` to maintain codes easier. ### Why are the changes needed? For the better readability of codes and maintenance. ### Does this PR introduce _any_ user-facing change? No, refactoring. ### How was this patch tested? Ran the existing unittests. CI in this PR should test it out too. Closes #31537 from HyukjinKwon/SPARK-34408. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-11 10:57:02 +09:00
Chao Sun	cd38287ce2	[SPARK-34419][SQL] Move PartitionTransforms.scala to scala directory ### What changes were proposed in this pull request? Move `PartitionTransforms.scala` from `sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions` to `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions`. ### Why are the changes needed? We should put java/scala files to their corresponding directories. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #31546 from sunchao/SPARK-34419. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-10 17:08:50 -08:00
David Li	9b875ceada	[SPARK-32953][PYTHON][SQL] Add Arrow self_destruct support to toPandas ### What changes were proposed in this pull request? Creating a Pandas dataframe via Apache Arrow currently can use twice as much memory as the final result, because during the conversion, both Pandas and Arrow retain a copy of the data. Arrow has a "self-destruct" mode now (Arrow >= 0.16) to avoid this, by freeing each column after conversion. This PR integrates support for this in toPandas, handling a couple of edge cases: self_destruct has no effect unless the memory is allocated appropriately, which is handled in the Arrow serializer here. Essentially, the issue is that self_destruct frees memory column-wise, but Arrow record batches are oriented row-wise: ``` Record batch 0: allocation 0: column 0 chunk 0, column 1 chunk 0, ... Record batch 1: allocation 1: column 0 chunk 1, column 1 chunk 1, ... ``` In this scenario, Arrow will drop references to all of column 0's chunks, but no memory will actually be freed, as the chunks were just slices of an underlying allocation. The PR copies each column into its own allocation so that memory is instead arranged as so: ``` Record batch 0: allocation 0 column 0 chunk 0, allocation 1 column 1 chunk 0, ... Record batch 1: allocation 2 column 0 chunk 1, allocation 3 column 1 chunk 1, ... ``` The optimization is disabled by default, and can be enabled with the Spark SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" set to "true". We can't always apply this optimization because it's more likely to generate a dataframe with immutable buffers, which Pandas doesn't always handle well, and because it is slower overall (since it only converts one column at a time instead of in parallel). ### Why are the changes needed? This lets us load larger datasets - in particular, with N bytes of memory, before we could never load a dataset bigger than N/2 bytes; now the overhead is more like N/1.25 or so. ### Does this PR introduce _any_ user-facing change? Yes - it adds a new SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" ### How was this patch tested? See the [mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html) - it was tested with Python memory_profiler. Unit tests added to check memory within certain bounds and correctness with the option enabled. Closes #29818 from lidavidm/spark-32953. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2021-02-10 09:58:46 -08:00
gengjiaan	32a523b56f	[SPARK-34234][SQL] Remove TreeNodeException that didn't work ### What changes were proposed in this pull request? `TreeNodeException` causes the error msg not clear and it didn't work well. Because the `TreeNodeException` looks redundancy, we could remove it. There are show a case: ``` val df = Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4)).toDF("x", "y") val hashAggDF = df.groupBy("x").agg(c, sum("y")) ``` The above code will use `HashAggregateExec`. In order to ensure that an exception will be thrown when executing `HashAggregateExec`, I added `throw new RuntimeException("calculate error")` into `72b7f8abfb/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L85)` So, if the above code is executed, `RuntimeException("calculate error")` will be thrown. Before this PR, the error is: ``` execute, tree: ShuffleQueryStage 0 +- Exchange hashpartitioning(x#105, 5), ENSURE_REQUIREMENTS, [id=#168] +- HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: ShuffleQueryStage 0 +- Exchange hashpartitioning(x#105, 5), ENSURE_REQUIREMENTS, [id=#168] +- HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:163) at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:81) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:79) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5(AdaptiveSparkPlanExec.scala:207) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5$adapted(AdaptiveSparkPlanExec.scala:205) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:205) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:289) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3708) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2977) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3699) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3697) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2977) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$3(DataFrameAggregateSuite.scala:665) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.DataFrameAggregateSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.DataFrameAggregateSuite.withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2(DataFrameAggregateSuite.scala:659) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2$adapted(DataFrameAggregateSuite.scala:655) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.sql.DataFrameAggregateSuite.assertNoExceptions(DataFrameAggregateSuite.scala:655) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126(DataFrameAggregateSuite.scala:695) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126$adapted(DataFrameAggregateSuite.scala:695) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$125(DataFrameAggregateSuite.scala:695) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:84) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:122) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:121) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.$anonfun$doMaterialize$1(QueryStageExec.scala:163) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 91 more Caused by: java.lang.RuntimeException: calculate error at org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1(HashAggregateExec.scala:85) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 103 more ``` After this PR, the error is: ``` calculate error java.lang.RuntimeException: calculate error at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:84) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:121) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:120) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:161) at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:80) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:78) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5(AdaptiveSparkPlanExec.scala:207) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5$adapted(AdaptiveSparkPlanExec.scala:205) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:205) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:289) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3708) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2977) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3699) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3697) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2977) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$3(DataFrameAggregateSuite.scala:665) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.DataFrameAggregateSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.DataFrameAggregateSuite.withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2(DataFrameAggregateSuite.scala:659) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2$adapted(DataFrameAggregateSuite.scala:655) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.sql.DataFrameAggregateSuite.assertNoExceptions(DataFrameAggregateSuite.scala:655) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126(DataFrameAggregateSuite.scala:695) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126$adapted(DataFrameAggregateSuite.scala:695) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$125(DataFrameAggregateSuite.scala:695) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) ``` ### Why are the changes needed? `TreeNodeException` didn't work well. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31337 from beliefer/SPARK-34234. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 06:25:33 +00:00
Max Gekk	c082c537de	[SPARK-34404][SQL] Add new Avro datasource options to control datetime rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new option `datetimeRebaseMode` for the Avro datasource. The option influences on loading ancient dates and timestamps column values from avro files. The option supports the same values as the SQL config `spark.sql.legacy.avro.datetimeRebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from avro files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load avro files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").format("avro").load(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").format("avro").load(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.avro.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.avro.datetimeRebaseModeInRead", "legacy") spark.read.format("avro").load(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "test:testOnly AvroV1Suite" $ build/sbt "test:testOnly AvroV2Suite" ``` Closes #31529 from MaxGekk/avro-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 06:23:10 +00:00
Chao Sun	0986f16c8d	[SPARK-34347][SQL] CatalogImpl.uncacheTable should invalidate in cascade for temp views ### What changes were proposed in this pull request? This PR includes the following changes: 1. in `CatalogImpl.uncacheTable`, invalidate caches in cascade when the target table is a temp view, and `spark.sql.legacy.storeAnalyzedPlanForView` is false (default value). 2. make `SessionCatalog.lookupTempView` public and return processed temp view plan (i.e., with `View` op). ### Why are the changes needed? Following [SPARK-34052](https://issues.apache.org/jira/browse/SPARK-34052) (#31107), we should invalidate in cascade for `CatalogImpl.uncacheTable` when the table is a temp view, so that the behavior is consistent. ### Does this PR introduce _any_ user-facing change? Yes, now `SQLContext.uncacheTable` will drop temp view in cascade by default. ### How was this patch tested? Added a UT Closes #31462 from sunchao/SPARK-34347. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-09 20:48:58 -08:00

1 2 3 4 5 ...

29373 commits