## What changes were proposed in this pull request?
AggUtils.planStreamingAggregation has some comments about DISTINCT aggregates,
while streaming aggregation does not support DISTINCT.
This seems to have been wrongly copy-pasted over.
## How was this patch tested?
Only a comment change.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#18937 from juliuszsompolski/streaming-agg-doc.
## What changes were proposed in this pull request?
`ArrowEvalPythonExec` and `FlatMapGroupsInPandasExec` are refering config values of `SQLConf` in function for `mapPartitions`/`mapPartitionsInternal`, but we should capture them in Driver.
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#19587 from ueshin/issues/SPARK-22370.
## What changes were proposed in this pull request?
Seems that end users can be confused by the union's behavior on Dataset of typed objects. We can clarity it more in the document of `union` function.
## How was this patch tested?
Only document change.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19570 from viirya/SPARK-22335.
## What changes were proposed in this pull request?
Canonicalized plans are not supposed to be executed. I ran into a case in which there's some code that accidentally calls execute on a canonicalized plan. This patch throws a more explicit exception when that happens.
## How was this patch tested?
Added a test case in SparkPlanSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#18828 from rxin/SPARK-21619.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-22333
In current version, users can use CURRENT_DATE() and CURRENT_TIMESTAMP() without specifying braces.
However, when a table has columns named as "current_date" or "current_timestamp", it will still be parsed as function call.
There are many such cases in our production cluster. We get the wrong answer due to this inappropriate behevior. In general, ColumnReference should get higher priority than timeFunctionCall.
## How was this patch tested?
unit test
manul test
Author: donnyzone <wellfengzhu@gmail.com>
Closes#19559 from DonnyZone/master.
## What changes were proposed in this pull request?
Adds a new optimisation rule 'ReplaceExceptWithNotFilter' that replaces Except logical with Filter operator and schedule it before applying 'ReplaceExceptWithAntiJoin' rule. This way we can avoid expensive join operation if one or both of the datasets of the Except operation are fully derived out of Filters from a same parent.
## How was this patch tested?
The patch is tested locally using spark-shell + unit test.
Author: Sathiya <sathiya.kumar@polytechnique.edu>
Closes#19451 from sathiyapk/SPARK-22181-optimize-exceptWithFilter.
## What changes were proposed in this pull request?
SPARK-18016 introduced `NestedClass` to avoid that the many methods generated by `splitExpressions` contribute to the outer class' constant pool, making it growing too much. Unfortunately, despite their definition is stored in the `NestedClass`, they all are invoked in the outer class and for each method invocation, there are two entries added to the constant pool: a `Methodref` and a `Utf8` entry (you can easily check this compiling a simple sample class with `janinoc` and looking at its Constant Pool). This limits the scalability of the solution with very large methods which are split in a lot of small ones. This means that currently we are generating classes like this one:
```
class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
public UnsafeRow apply(InternalRow i) {
rowWriter.zeroOutNullBytes();
apply_0(i);
apply_1(i);
...
nestedClassInstance.apply_862(i);
nestedClassInstance.apply_863(i);
...
nestedClassInstance1.apply_1612(i);
nestedClassInstance1.apply_1613(i);
...
}
...
private class NestedClass {
private void apply_862(InternalRow i) { ... }
private void apply_863(InternalRow i) { ... }
...
}
private class NestedClass1 {
private void apply_1612(InternalRow i) { ... }
private void apply_1613(InternalRow i) { ... }
...
}
}
```
This PR reduce the Constant Pool size of the outer class by adding a new method to each nested class: in this method we invoke all the small methods generated by `splitExpression` in that nested class. In this way, in the outer class there is only one method invocation per nested class, reducing by orders of magnitude the entries in its constant pool because of method invocations. This means that after the patch the generated code becomes:
```
class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
public UnsafeRow apply(InternalRow i) {
rowWriter.zeroOutNullBytes();
apply_0(i);
apply_1(i);
...
nestedClassInstance.apply(i);
nestedClassInstance1.apply(i);
...
}
...
private class NestedClass {
private void apply_862(InternalRow i) { ... }
private void apply_863(InternalRow i) { ... }
...
private void apply(InternalRow i) {
apply_862(i);
apply_863(i);
...
}
}
private class NestedClass1 {
private void apply_1612(InternalRow i) { ... }
private void apply_1613(InternalRow i) { ... }
...
private void apply(InternalRow i) {
apply_1612(i);
apply_1613(i);
...
}
}
}
```
## How was this patch tested?
Added UT and existing UTs
Author: Marco Gaido <mgaido@hortonworks.com>
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#19480 from mgaido91/SPARK-22226.
## What changes were proposed in this pull request?
This PR is to clean the related codes majorly based on the today's code review on https://github.com/apache/spark/pull/19559
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19585 from gatorsmile/trivialFixes.
## What changes were proposed in this pull request?
Adding date and timestamp support with Arrow for `toPandas()` and `pandas_udf`s. Timestamps are stored in Arrow as UTC and manifested to the user as timezone-naive localized to the Python system timezone.
## How was this patch tested?
Added Scala tests for date and timestamp types under ArrowConverters, ArrowUtils, and ArrowWriter suites. Added Python tests for `toPandas()` and `pandas_udf`s with date and timestamp types.
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#18664 from BryanCutler/arrow-date-timestamp-SPARK-21375.
## What changes were proposed in this pull request?
It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row.
This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19577 from cloud-fan/encoder.
## What changes were proposed in this pull request?
This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.
To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19579 from cloud-fan/bug2.
## What changes were proposed in this pull request?
Add a flag "spark.sql.files.ignoreMissingFiles" to parallel the existing flag "spark.sql.files.ignoreCorruptFiles".
## How was this patch tested?
new unit test
Author: Jose Torres <jose@databricks.com>
Closes#19581 from joseph-torres/SPARK-22366.
## What changes were proposed in this pull request?
Support unit tests of external code (i.e., applications that use spark) using scalatest that don't want to use FunSuite. SharedSparkContext already supports this, but SharedSQLContext does not.
I've introduced SharedSparkSession as a parent to SharedSQLContext, written in a way that it does support all scalatest styles.
## How was this patch tested?
There are three new unit test suites added that just test using FunSpec, FlatSpec, and WordSpec.
Author: Nathan Kronenfeld <nicole.oresme@gmail.com>
Closes#19529 from nkronenfeld/alternative-style-tests-2.
## What changes were proposed in this pull request?
Removed one unused method.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19508 from viirya/SPARK-20783-followup.
## What changes were proposed in this pull request?
Scala 2.12's `Future` defines two new methods to implement, `transform` and `transformWith`. These can be implemented naturally in Spark's `FutureAction` extension and subclasses, but, only in terms of the new methods that don't exist in Scala 2.11. To support both at the same time, reflection is used to implement these.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#19561 from srowen/SPARK-22322.
## What changes were proposed in this pull request?
Rewritten error message for clarity. Added extra information in case of attribute name collision, hinting the user to double-check referencing two different tables
## How was this patch tested?
No functional changes, only final message has changed. It has been tested manually against the situation proposed in the JIRA ticket. Automated tests in repository pass.
This PR is original work from me and I license this work to the Spark project
Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
Author: Ruben Berenguel Montoro <ruben@dreamattic.com>
Author: Ruben Berenguel <ruben@mostlymaths.net>
Closes#17100 from rberenguel/SPARK-13947-error-message.
## What changes were proposed in this pull request?
We enable table cache `InMemoryTableScanExec` to provide `ColumnarBatch` now. But the cached batches are retrieved without pruning. In this case, we still need to do partition batch pruning.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19569 from viirya/SPARK-22348.
## What changes were proposed in this pull request?
The current implementation of `ApproxCountDistinctForIntervals` is `ImperativeAggregate`. The number of `aggBufferAttributes` is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD.
Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of `aggBufferAttributes` can easily reach tens of thousands or even more.
This leads to a huge method in codegen and causes error:
```
org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB.
```
Besides, huge generated methods also result in performance regression.
In this PR, we change its implementation to `TypedImperativeAggregate`. After the fix, `ApproxCountDistinctForIntervals` can deal with more than thousands endpoints without throwing codegen error, and improve performance from `20 sec` to `2 sec` in a test case of 500 endpoints.
## How was this patch tested?
Test by an added test case and existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19506 from wzhfy/change_forIntervals_typedAgg.
TIMESTAMP (-101), BINARY_DOUBLE (101) and BINARY_FLOAT (100) are handled in OracleDialect
## What changes were proposed in this pull request?
When a oracle table contains columns whose type is BINARY_FLOAT or BINARY_DOUBLE, spark sql fails to load a table with SQLException
```
java.sql.SQLException: Unsupported type 101
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:235)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:291)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:113)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
```
## How was this patch tested?
I updated a UT which covers type conversion test for types (-101, 100, 101), on top of that I tested this change against actual table with those columns and it was able to read and write to the table.
Author: Kohki Nishio <taroplus@me.com>
Closes#19548 from taroplus/oracle_sql_types_101.
## What changes were proposed in this pull request?
When [SPARK-19261](https://issues.apache.org/jira/browse/SPARK-19261) implements `ALTER TABLE ADD COLUMNS`, ORC data source is omitted due to SPARK-14387, SPARK-16628, and SPARK-18355. Now, those issues are fixed and Spark 2.3 is [using Spark schema to read ORC table instead of ORC file schema](e6e36004af). This PR enables `ALTER TABLE ADD COLUMNS` for ORC data source.
## How was this patch tested?
Pass the updated and added test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#19545 from dongjoon-hyun/SPARK-21929.
## What changes were proposed in this pull request?
Plan equality should be computed by `canonicalized`, so we can remove unnecessary `hashCode` and `equals` methods.
## How was this patch tested?
Existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19539 from wzhfy/remove_equals.
## What changes were proposed in this pull request?
This is a follow-up of #18732.
This pr modifies `GroupedData.apply()` method to convert pandas udf to grouped udf implicitly.
## How was this patch tested?
Exisiting tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#19517 from ueshin/issues/SPARK-20396/fup2.
## What changes were proposed in this pull request?
spark does not support grouping__id, it has grouping_id() instead.
But it is not convenient for hive user to change to spark-sql
so this pr is to replace grouping__id with grouping_id()
hive user need not to alter their scripts
## How was this patch tested?
test with SQLQuerySuite.scala
Author: CenYuhai <yuhai.cen@ele.me>
Closes#18270 from cenyuhai/SPARK-21055.
## What changes were proposed in this pull request?
This is a very trivial PR, simply marking `strategies` in `SparkPlanner` with the `override` keyword for clarity since it is overriding `strategies` in `QueryPlanner` two levels up in the class hierarchy. I was reading through the code to learn a bit and got stuck on this fact for a little while, so I figured this may be helpful so that another developer new to the project doesn't get stuck where I was.
I did not make a JIRA ticket for this because it is so trivial, but I'm happy to do so to adhere to the contribution guidelines if required.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Eric Perry <eric@ericjperry.com>
Closes#19537 from ericjperry/override-strategies.
## What changes were proposed in this pull request?
A working prototype for data source v2 write path.
The writing framework is similar to the reading framework. i.e. `WriteSupport` -> `DataSourceV2Writer` -> `DataWriterFactory` -> `DataWriter`.
Similar to the `FileCommitPotocol`, the writing API has job and task level commit/abort to support the transaction.
## How was this patch tested?
new tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19269 from cloud-fan/data-source-v2-write.
## What changes were proposed in this pull request?
Fix java style issues
## How was this patch tested?
Run `./dev/lint-java` locally since it's not run on Jenkins
Author: Andrew Ash <andrew@andrewash.com>
Closes#19486 from ash211/aash/fix-lint-java.
## What changes were proposed in this pull request?
This PR addresses the comments by gatorsmile on [the previous PR](https://github.com/apache/spark/pull/19494).
## How was this patch tested?
Previous UT and added UT.
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#19522 from mgaido91/SPARK-22249_FOLLOWUP.
## What changes were proposed in this pull request?
To let the same aggregate function that appear multiple times in an Aggregate be evaluated only once, we need to deduplicate the aggregate expressions. The original code was trying to use a "distinct" call to get a set of aggregate expressions, but did not work, since the "distinct" did not compare semantic equality. And even if it did, further work should be done in result expression rewriting.
In this PR, I changed the "set" to a map mapping the semantic identity of a aggregate expression to itself. Thus, later on, when rewriting result expressions (i.e., output expressions), the aggregate expression reference can be fixed.
## How was this patch tested?
Added a new test in SQLQuerySuite
Author: maryannxue <maryann.xue@gmail.com>
Closes#19488 from maryannxue/spark-22266.
## What changes were proposed in this pull request?
Complex state-updating and/or timeout-handling logic in mapGroupsWithState functions may require taking decisions based on the current event-time watermark and/or processing time. Currently, you can use the SQL function `current_timestamp` to get the current processing time, but it needs to be passed inserted in every row with a select, and then passed through the encoder, which isn't efficient. Furthermore, there is no way to get the current watermark.
This PR exposes both of them through the GroupState API.
Additionally, it also cleans up some of the GroupState docs.
## How was this patch tested?
New unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#19495 from tdas/SPARK-22278.
## What changes were proposed in this pull request?
In Average.scala, it has
```
override lazy val evaluateExpression = child.dataType match {
case DecimalType.Fixed(p, s) =>
// increase the precision and scale to prevent precision loss
val dt = DecimalType.bounded(p + 14, s + 4)
Cast(Cast(sum, dt) / Cast(count, dt), resultType)
case _ =>
Cast(sum, resultType) / Cast(count, resultType)
}
def setChild (newchild: Expression) = {
child = newchild
}
```
It is possible that Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow. Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0)
## How was this patch tested?
In DataFrameSuite, I will add a test case.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#19496 from huaxingao/spark-22271.
## What changes were proposed in this pull request?
Evaluate one-sided conditions early in stream-stream joins.
This is in addition to normal filter pushdown, because integrating it with the join logic allows it to take place in outer join scenarios. This means that rows which can never satisfy the join condition won't clog up the state.
## How was this patch tested?
new unit tests
Author: Jose Torres <jose@databricks.com>
Closes#19452 from joseph-torres/SPARK-22136.
## What changes were proposed in this pull request?
#### before
```scala
scala> val words = spark.read.textFile("README.md").flatMap(_.split(" "))
words: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val grouped = words.groupByKey(identity)
grouped: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = org.apache.spark.sql.KeyValueGroupedDataset65214862
```
#### after
```scala
scala> val words = spark.read.textFile("README.md").flatMap(_.split(" "))
words: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val grouped = words.groupByKey(identity)
grouped: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = [key: [value: string], value: [value: string]]
```
## How was this patch tested?
existing ut
cc gatorsmile cloud-fan
Author: Kent Yao <yaooqinn@hotmail.com>
Closes#19363 from yaooqinn/minor-dataset-tostring.
## What changes were proposed in this pull request?
As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it.
## How was this patch tested?
Added UT.
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#19494 from mgaido91/SPARK-22249.
## What changes were proposed in this pull request?
This PR aims to
- Rename `OrcRelation` to `OrcFileFormat` object.
- Replace `OrcRelation.ORC_COMPRESSION` with `org.apache.orc.OrcConf.COMPRESS`. Since [SPARK-21422](https://issues.apache.org/jira/browse/SPARK-21422), we can use `OrcConf.COMPRESS` instead of Hive's.
```scala
// The references of Hive's classes will be minimized.
val ORC_COMPRESSION = "orc.compress"
```
## How was this patch tested?
Pass the Jenkins with the existing and updated test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#19502 from dongjoon-hyun/SPARK-22282.
## What changes were proposed in this pull request?
`ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle.
## How was this patch tested?
Added Jenkins test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19501 from viirya/SPARK-22223.
## What changes were proposed in this pull request?
In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan has the expected partitioning for Streaming Stateful Operators. The problem is that we are not allowed to access this information during planning.
The reason we added that check was because CoalesceExec could actually create RDDs with 0 partitions. We should fix it such that when CoalesceExec says that there is a SinglePartition, there is in fact an inputRDD of 1 partition instead of 0 partitions.
## How was this patch tested?
Regression test in StreamingQuerySuite
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#19467 from brkyvz/stateful-op.
## What changes were proposed in this pull request?
When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`.
```java
/* 055 */ private int maxSteps = 2;
/* 056 */ private int numRows = 0;
/* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType);
/* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType);
/* 059 */ private Object emptyVBase;
```
We should remove the double-quotes to refer the values in `references` properly:
```java
/* 055 */ private int maxSteps = 2;
/* 056 */ private int numRows = 0;
/* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType);
/* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType);
/* 059 */ private Object emptyVBase;
```
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#19491 from ueshin/issues/SPARK-22273.
## What changes were proposed in this pull request?
`BasicWriteTaskStatsTracker.getFileSize()` to catch `FileNotFoundException`, log info and then return 0 as a file size.
This ensures that if a newly created file isn't visible due to the store not always having create consistency, the metric collection doesn't cause the failure.
## How was this patch tested?
New test suite included, `BasicWriteTaskStatsTrackerSuite`. This not only checks the resilience to missing files, but verifies the existing logic as to how file statistics are gathered.
Note that in the current implementation
1. if you call `Tracker..getFinalStats()` more than once, the file size count will increase by size of the last file. This could be fixed by clearing the filename field inside `getFinalStats()` itself.
2. If you pass in an empty or null string to `Tracker.newFile(path)` then IllegalArgumentException is raised, but only in `getFinalStats()`, rather than in `newFile`. There's a test for this behaviour in the new suite, as it verifies that only FNFEs get swallowed.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#18979 from steveloughran/cloud/SPARK-21762-missing-files-in-metrics.
## What changes were proposed in this pull request?
This PR changes `keyWithIndexToNumValues` to `keyWithIndexToValue`.
There will be directories on HDFS named with this `keyWithIndexToNumValues`. So if we ever want to fix this, let's fix it now.
## How was this patch tested?
existing unit test cases.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#19435 from lw-lin/keyWithIndex.
## What changes were proposed in this pull request?
This is an effort to reduce the difference between Hive and Spark. Spark supports case-sensitivity in columns. Especially, for Struct types, with `spark.sql.caseSensitive=true`, the following is supported.
```scala
scala> sql("select named_struct('a', 1, 'A', 2).a").show
+--------------------------+
|named_struct(a, 1, A, 2).a|
+--------------------------+
| 1|
+--------------------------+
scala> sql("select named_struct('a', 1, 'A', 2).A").show
+--------------------------+
|named_struct(a, 1, A, 2).A|
+--------------------------+
| 2|
+--------------------------+
```
And vice versa, with `spark.sql.caseSensitive=false`, the following is supported.
```scala
scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
+--------------------+--------------------+
|named_struct(a, 1).A|named_struct(A, 1).a|
+--------------------+--------------------+
| 1| 1|
+--------------------+--------------------+
```
However, types are considered different. For example, SET operations fail.
```scala
scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<A:int> <> struct<a:int> at the first column of the second table;;
'Union
:- Project [named_struct(a, 1) AS named_struct(a, 1)#57]
: +- OneRowRelation$
+- Project [named_struct(A, 2) AS named_struct(A, 2)#58]
+- OneRowRelation$
```
This PR aims to support case-insensitive type equality. For example, in Set operation, the above operation succeed when `spark.sql.caseSensitive=false`.
```scala
scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
+------------------+
|named_struct(a, 1)|
+------------------+
| [1]|
| [2]|
+------------------+
```
## How was this patch tested?
Pass the Jenkins with a newly add test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18460 from dongjoon-hyun/SPARK-21247.
## What changes were proposed in this pull request?
Due to optimizer removing some unnecessary aliases, the logical and physical plan may have different output attribute ids. FileFormatWriter should handle this when creating the physical sort node.
## How was this patch tested?
new regression test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19483 from cloud-fan/bug2.
## What changes were proposed in this pull request?
The method `deterministic` is frequently called in optimizer.
Refactor `deterministic` as lazy value, in order to avoid redundant computations.
## How was this patch tested?
Simple benchmark test over TPC-DS queries, run time from query string to optimized plan(continuous 20 runs, and get the average of last 5 results):
Before changes: 12601 ms
After changes: 11993ms
This is 4.8% performance improvement.
Also run test with Unit test.
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19478 from gengliangwang/deterministicAsLazyVal.
## What changes were proposed in this pull request?
`ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`
This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.
Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.
(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)
Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,
## How was this patch tested?
The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.
| committer | summary | outcome |
|-----------|---------|---------|
| parquet | true | success |
| parquet | false | success |
| marking | false | success with marker |
| marking | true | exception |
All tests are happy.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#19448 from steveloughran/cloud/SPARK-22217-committer.
## What changes were proposed in this pull request?
Adding the code for setting 'aggregate time' metric to non-codegen path in HashAggregateExec and to ObjectHashAggregateExces.
## How was this patch tested?
Tested manually.
Author: Ala Luszczak <ala@databricks.com>
Closes#19473 from ala/fix-agg-time.
## What changes were proposed in this pull request?
As we discussed in https://github.com/apache/spark/pull/19136#discussion_r137023744 , we should push down operators to data source before planning, so that data source can report statistics more accurate.
This PR also includes some cleanup for the read path.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19424 from cloud-fan/follow.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`).
However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.
Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands.
So the fix is simple, this PR partially revert #18064 and make `InsertIntoXXX` commands leaf nodes again.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19474 from cloud-fan/bug.
## What changes were proposed in this pull request?
Implement StreamingRelation.computeStats to fix explain
## How was this patch tested?
- unit tests: `StreamingRelation.computeStats` and `StreamingExecutionRelation.computeStats`.
- regression tests: `explain join with a normal source` and `explain join with MemoryStream`.
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes#19465 from zsxwing/SPARK-21988.
## What changes were proposed in this pull request?
Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2.
Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above.
## How was this patch tested?
Added a new test case and fix existing test cases.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#19438 from wzhfy/improve_percentile_approx.
This change adds a new SQL config key that is equivalent to SparkContext's
"spark.extraListeners", allowing users to register QueryExecutionListener
instances through the Spark configuration system instead of having to
explicitly do it in code.
The code used by SparkContext to implement the feature was refactored into
a helper method in the Utils class, and SQL's ExecutionListenerManager was
modified to use it to initialize listener declared in the configuration.
Unit tests were added to verify all the new functionality.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19309 from vanzin/SPARK-19558.