## What changes were proposed in this pull request?
This PR implements bulk-copy for `ColumnVector.Array.to<type>Array()` methods (e.g. `toIntArray()`) in `ColumnVector.Array` by using `System.arrayCopy()` or `Platform.copyMemory()`.
Before this PR, when one of these method is called, the generic method in `ArrayData` is called. It is not fast since element-wise copy is performed.
This PR can improve performance of a benchmark program by 1.9x and 3.2x.
Without this PR
```
OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
Int Array Best/Avg Time(ms) Rate(M/s) Per Row(ns)
------------------------------------------------------------------------------------------------
ON_HEAP 586 / 628 14.3 69.9
OFF_HEAP 893 / 902 9.4 106.5
```
With this PR
```
OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz
Int Array Best/Avg Time(ms) Rate(M/s) Per Row(ns)
------------------------------------------------------------------------------------------------
ON_HEAP 306 / 331 27.4 36.4
OFF_HEAP 282 / 287 29.8 33.6
```
Source program
```
(MemoryMode.ON_HEAP :: MemoryMode.OFF_HEAP :: Nil).foreach { memMode => {
val len = 8 * 1024 * 1024
val column = ColumnVector.allocate(len * 2, new ArrayType(IntegerType, false), memMode)
val data = column.arrayData
var i = 0
while (i < len) {
data.putInt(i, i)
i += 1
}
column.putArray(0, 0, len)
val benchmark = new Benchmark("Int Array", len, minNumIters = 20)
benchmark.addCase(s"$memMode") { iter =>
var i = 0
while (i < 50) {
column.getArray(0).toIntArray
i += 1
}
}
benchmark.run
}}
```
## How was this patch tested?
Added test suite
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#18425 from kiszk/SPARK-21217.
## What changes were proposed in this pull request?
Making EventTimeWatermarkExec explicitly UnaryExecNode
/cc tdas zsxwing
## How was this patch tested?
Local build.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#18509 from jaceklaskowski/EventTimeWatermarkExec-UnaryExecNode.
## What changes were proposed in this pull request?
SparkContext is shared by all sessions, we should not update its conf for only one session.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18536 from cloud-fan/config.
## What changes were proposed in this pull request?
Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.
In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#18485 from tdas/SPARK-21267.
## What changes were proposed in this pull request?
Currently we can't produce a `Dataset` containing `Set` in SparkSQL. This PR tries to support serialization/deserialization of `Set`.
Because there's no corresponding internal data type in SparkSQL for a `Set`, the most proper choice for serializing a set should be an array.
## How was this patch tested?
Added unit tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18416 from viirya/SPARK-21204.
## What changes were proposed in this pull request?
When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals.
## How was this patch tested?
New test in SQLQuerySuite.
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#18455 from bogdanrdc/SPARK-21228.
## What changes were proposed in this pull request?
1. move `StatisticsCollectionTestBase` to a separate file.
2. move some test cases to `StatisticsCollectionSuite` so that `hive/StatisticsSuite` only keeps tests that need hive support.
3. clear up some test cases.
## How was this patch tested?
Existing tests.
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#18545 from wzhfy/cleanStatSuites.
## What changes were proposed in this pull request?
Right now in the UI, after SPARK-20213, we can show the operations to write data out. However, there is no way to associate metrics with data writes. We should show relative metrics on the operations.
#### Supported commands
This change supports updating metrics for file-based data writing operations, including `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`.
Supported metrics:
* number of written files
* number of dynamic partitions
* total bytes of written data
* total number of output rows
* average writing data out time (ms)
* (TODO) min/med/max number of output rows per file/partition
* (TODO) min/med/max bytes of written data per file/partition
#### Commands not supported
`InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`:
The two commands uses DataSource APIs to write data out, i.e., the logic of writing data out is delegated to the DataSource implementations, such as `InsertableRelation.insert` and `CreatableRelationProvider.createRelation`. So we can't obtain metrics from delegated methods for now.
`CreateHiveTableAsSelectCommand`, `CreateDataSourceTableAsSelectCommand` :
The two commands invokes other commands to write data out. The invoked commands can even write to non file-based data source. We leave them as future TODO.
#### How to update metrics of writing files out
A `RunnableCommand` which wants to update metrics, needs to override its `metrics` and provide the metrics data structure to `ExecutedCommandExec`.
The metrics are prepared during the execution of `FileFormatWriter`. The callback function passed to `FileFormatWriter` will accept the metrics and update accordingly.
There is a metrics updating function in `RunnableCommand`. In runtime, the function will be bound to the spark context and `metrics` of `ExecutedCommandExec` and pass to `FileFormatWriter`.
## How was this patch tested?
Updated unit tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18159 from viirya/SPARK-20703-2.
## What changes were proposed in this pull request?
Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#18442 from tdas/DatastreamReaderWriterSuite-fix.
## What changes were proposed in this pull request?
Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
## How was this patch tested?
Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
Author: Sumedh Wale <swale@snappydata.io>
Closes#18535 from sumwale/SPARK-21312.
### What changes were proposed in this pull request?
This PR removes SQLConf parameters from the optimizer rules
### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18533 from gatorsmile/rmSQLConfOptimizer.
## What changes were proposed in this pull request?
This PR uses `runUninterruptibly` to avoid that the clean up codes in StreamExecution is interrupted. It also removes an optimization in `runUninterruptibly` to make sure this method never throw `InterruptedException`.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18461 from zsxwing/SPARK-21248.
### What changes were proposed in this pull request?
This PR is to remove SQLConf parameters from the parser-related classes.
### How was this patch tested?
The existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18531 from gatorsmile/rmSQLConfParser.
## What changes were proposed in this pull request?
Support register Java UDAFs in PySpark so that user can use Java UDAF in PySpark. Besides that I also add api in `UDFRegistration`
## How was this patch tested?
Unit test is added
Author: Jeff Zhang <zjffdu@apache.org>
Closes#17222 from zjffdu/SPARK-19439.
## What changes were proposed in this pull request?
support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
## How was this patch tested?
manual test and added test cases
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
Closes#17681 from ouyangxiaochen/spark-419.
### What changes were proposed in this pull request?
It is strange to see the following error message. Actually, the column is from another table.
```
cannot resolve '`right.a`' given input columns: [a, c, d];
```
After the PR, the error message looks like
```
cannot resolve '`right.a`' given input columns: [left.a, right.c, right.d];
```
### How was this patch tested?
Added a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18520 from gatorsmile/removeSQLConf.
## What changes were proposed in this pull request?
`SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
**BEFORE**
```scala
$ bin/spark-shell
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
...
Caused by: org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.security.AccessControlException:
Permission denied: user=spark, access=READ,
inode="/apps/hive/warehouse":hive:hdfs:drwx------
```
As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
**AFTER**
```scala
$ bin/spark-shell
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.range(0, 10, 1).count()
res0: Long = 10
```
## How was this patch tested?
Manual.
This closes#18512 .
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#18501 from dongjoon-hyun/SPARK-20256.
## What changes were proposed in this pull request?
when creating table like following:
> create table timestamp_test(id int(11), time_stamp timestamp not null default current_timestamp);
The result of Excuting "insert into timestamp_test values (111, null)" is different between Spark and JDBC.
```
mysql> select * from timestamp_test;
+------+---------------------+
| id | time_stamp |
+------+---------------------+
| 111 | 1970-01-01 00:00:00 | -> spark
| 111 | 2017-06-27 19:32:38 | -> mysql
+------+---------------------+
2 rows in set (0.00 sec)
```
Because in such case ```StructField.nullable``` is false, so the generated codes of ```InvokeLike``` and ```BoundReference``` don't check whether the field is null or not. Instead, they directly use ```CodegenContext.INPUT_ROW.getLong(1)```, however, ```UnsafeRow.setNullAt(1)``` will put 0 in the underlying memory.
The PR will ```always``` set ```StructField.nullable``` true after obtaining metadata from jdbc connection, Since we can insert null to not null timestamp column in MySQL. In this way, spark will propagate null to underlying DB engine, and let DB to choose how to process NULL.
## How was this patch tested?
Added tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: YIHAODIAN\wangshuangshuang <wangshuangshuang@yihaodian.com>
Author: Shuangshuang Wang <wsszone@gmail.com>
Closes#18445 from shuangshuangwang/SPARK-19726.
### What changes were proposed in this pull request?
SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases.
### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18469 from gatorsmile/withSQLConf.
## What changes were proposed in this pull request?
If the created ACTIVE sparkContext is not EXPLICITLY passed through the Builder's API `sparkContext()`, the conf of this sparkContext will also contain the conf set through the API `config()`; otherwise, the conf of this sparkContext will NOT contain the conf set through the API `config()`
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18517 from gatorsmile/fixTestCase2.
## What changes were proposed in this pull request?
Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18510 from cloud-fan/minor.
## What changes were proposed in this pull request?
This pr added code to print the same warning messages with `===` cases when using NULL-safe equals (`<=>`).
## How was this patch tested?
Existing tests.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18436 from maropu/SPARK-20073.
### Idea
This PR adds validation to REFRESH sql statements. Currently, users can specify whatever they want as resource path. For example, spark.sql("REFRESH ! $ !") will be executed without any exceptions.
### Implementation
I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below:
SqlBase.g4
```
...
| REFRESH TABLE tableIdentifier #refreshTable
| REFRESH resourcePath #refreshResource
...
resourcePath
: STRING
| (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be added if needed
;
```
It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the current approach that is implemented in the code.
Let me know your opinion on which one is better.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#18368 from aokolnychyi/spark-21102.
## What changes were proposed in this pull request?
Move `compileValue` method in JDBCRDD to JdbcDialect, and override the `compileValue` method in OracleDialect to rewrite the Oracle-specific timestamp and date literals in where clause.
## How was this patch tested?
An integration test has been added.
Author: Rui Zha <zrdt713@gmail.com>
Author: Zharui <zrdt713@gmail.com>
Closes#18451 from SharpRay/extend-compileValue-to-dialects.
## What changes were proposed in this pull request?
OutputFakerExec was added long ago and is not used anywhere now so we should remove it.
## How was this patch tested?
N/A
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18473 from jiangxb1987/OutputFakerExec.
## What changes were proposed in this pull request?
We currently implement statistics propagation directly in logical plan. Given we already have two different implementations, it'd make sense to actually decouple the two and add stats propagation using mixin. This would reduce the coupling between logical plan and statistics handling.
This can also be a powerful pattern in the future to add additional properties (e.g. constraints).
## How was this patch tested?
Should be covered by existing test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#18479 from rxin/stats-trait.
## What changes were proposed in this pull request?
Update stats after the following data changing commands:
- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
- LoadDataCommand
- TruncateTableCommand
- AlterTableSetLocationCommand
- AlterTableDropPartitionCommand
## How was this patch tested?
Added new test cases.
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#18334 from wzhfy/changeStatsForOperation.
## What changes were proposed in this pull request?
For performance reasons, `UnsafeRow.getString`, `getStruct`, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance.
When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime.
However, when we updating values of InternalRow(e.g. in mutable projection and safe projection), we only copy UTF8String, we should also copy InternalRow, ArrayData and MapData. This PR fixes this, and also fixes the copy of vairous InternalRow, ArrayData and MapData implementations.
## How was this patch tested?
new regression tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18483 from cloud-fan/fix-copy.
## What changes were proposed in this pull request?
Remove `numHashCollisions` in `BytesToBytesMap`. And change `getAverageProbesPerLookup()` to `getAverageProbesPerLookup` as suggested.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18480 from viirya/SPARK-21052-followup.
### What changes were proposed in this pull request?
Function argument should not be named expressions. It could cause two issues:
- Misleading error message
- Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser.
```
spark-sql> select count(distinct c1, distinct c2) from t1;
Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26;
'Project [unresolvedalias('count(c1#30, 'distinct), None)]
+- SubqueryAlias t1
+- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31]
```
After the fix, the error message becomes
```
spark-sql> select count(distinct c1, distinct c2) from t1;
Error in query:
extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35)
== SQL ==
select count(distinct c1, distinct c2) from t1
-----------------------------------^^^
```
### How was this patch tested?
Added a test case to parser suite.
Author: Xiao Li <gatorsmile@gmail.com>
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18338 from gatorsmile/parserDistinctAggFunc.
## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909, after this PR Spark spills more eagerly.
This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.
## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#18470 from hvanhovell/SPARK-21258.
## What changes were proposed in this pull request?
This adds the average hash map probe metrics to join operator such as `BroadcastHashJoin` and `ShuffledHashJoin`.
This PR adds the API to `HashedRelation` to get average hash map probe.
## How was this patch tested?
Related test cases are added.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18301 from viirya/SPARK-21052.
## What changes were proposed in this pull request?
Same with SPARK-20985.
Fix code style for constructing and stopping a `SparkContext`. Assure the context is stopped to avoid other tests complain that there's only one `SparkContext` can exist.
Author: jinxing <jinxing6042@126.com>
Closes#18454 from jinxing64/SPARK-21240.
## What changes were proposed in this pull request?
This is kind of another follow-up for https://github.com/apache/spark/pull/18064 .
In #18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. #18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API.
This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18450 from cloud-fan/execution-id.
## What changes were proposed in this pull request?
Invalidate spark's stats after data changing commands:
- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
- LoadDataCommand
- TruncateTableCommand
- AlterTableSetLocationCommand
- AlterTableDropPartitionCommand
## How was this patch tested?
Added test cases.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#18449 from wzhfy/removeStats.
## What changes were proposed in this pull request?
`QueryPlan.preCanonicalized` is only overridden in a few places, and it does introduce an extra concept to `QueryPlan` which may confuse people.
This PR removes it and override `canonicalized` in these places
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18440 from cloud-fan/minor.
## What changes were proposed in this pull request?
This PR proposes to support a DDL-formetted string as schema as below:
```r
mockLines <- c("{\"name\":\"Michael\"}",
"{\"name\":\"Andy\", \"age\":30}",
"{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```
## How was this patch tested?
Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18431 from HyukjinKwon/r-ddl-schema.
## What changes were proposed in this pull request?
If someone creates a HiveSession, the planner in `IncrementalExecution` doesn't take into account the Hive scan strategies. This causes joins of Streaming DataFrame's with Hive tables to fail.
## How was this patch tested?
Regression test
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#18426 from brkyvz/hive-join.
## What changes were proposed in this pull request?
The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format.
val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
val ds = spark.createDataset(data)
The `valueConverter` in `ExternalMapToCatalyst` looks like:
if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.
Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18418 from viirya/SPARK-19104.
## What changes were proposed in this pull request?
in https://github.com/apache/spark/pull/18064, to work around the nested sql execution id issue, we introduced several internal methods in `Dataset`, like `collectInternal`, `countInternal`, `showInternal`, etc., to avoid nested execution id.
However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in `Dataset`.
Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing `SQLExecution.ignoreNestedExecutionId`. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with `SQLExecution.ignoreNestedExecutionId`, and this is more expansible than the previous approach.
The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by rdblue
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#18419 from cloud-fan/follow.
## What changes were proposed in this pull request?
Time windowing in Spark currently performs an Expand + Filter, because there is no way to guarantee the amount of windows a timestamp will fall in, in the general case. However, for tumbling windows, a record is guaranteed to fall into a single bucket. In this case, doubling the number of records with Expand is wasteful, and can be improved by using a simple Projection instead.
Benchmarks show that we get an order of magnitude performance improvement after this patch.
## How was this patch tested?
Existing unit tests. Benchmarked using the following code:
```scala
import org.apache.spark.sql.functions._
spark.time {
spark.range(numRecords)
.select(from_unixtime((current_timestamp().cast("long") * 1000 + 'id / 1000) / 1000) as 'time)
.select(window('time, "10 seconds"))
.count()
}
```
Setup:
- 1 c3.2xlarge worker (8 cores)
![image](https://user-images.githubusercontent.com/5243515/27348748-ed991b84-55a9-11e7-8f8b-6e7abc524417.png)
1 B rows ran in 287 seconds after this optimization. I didn't wait for it to finish without the optimization. Shows about 5x improvement for large number of records.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#18364 from brkyvz/opt-tumble.
## What changes were proposed in this pull request?
Storage URI of a partitioned table may or may not point to a directory under which individual partitions are stored. In fact, individual partitions may be located in totally unrelated directories. Before this change, ANALYZE TABLE table COMPUTE STATISTICS command calculated total size of a table by adding up sizes of files found under table's storage URI. This calculation could produce 0 if partitions are stored elsewhere.
This change uses storage URIs of individual partitions to calculate the sizes of all partitions of a table and adds these up to produce the total size of a table.
CC: wzhfy
## How was this patch tested?
Added unit test.
Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero.
Author: Masha Basmanova <mbasmanova@fb.com>
Closes#18309 from mbasmanova/mbasmanova-analyze-part-table.
### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet
INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
SELECT custom_fields.id, custom_fields.value FROM tab1
```
The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
### How was this patch tested?
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18412 from gatorsmile/castStruct.
## What changes were proposed in this pull request?
This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830
When merging this PR, please give the credit to gaborfeher
## How was this patch tested?
Added a test case to OracleIntegrationSuite.scala
Author: Gabor Feher <gabor.feher@lynxanalytics.com>
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18408 from gatorsmile/OracleType.
## What changes were proposed in this pull request?
This pr supported a DDL-formatted string in `DataStreamReader.schema`.
This fix could make users easily define a schema without importing the type classes.
For example,
```scala
scala> spark.readStream.schema("col0 INT, col1 DOUBLE").load("/tmp/abc").printSchema()
root
|-- col0: integer (nullable = true)
|-- col1: double (nullable = true)
```
## How was this patch tested?
Added tests in `DataStreamReaderWriterSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18373 from HyukjinKwon/SPARK-20431.
## What changes were proposed in this pull request?
`isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by https://github.com/apache/spark/pull/11148 and https://github.com/apache/spark/pull/11050
Since SQL Generation is removed, we do not need to keep `isTableSample`.
## How was this patch tested?
The existing test cases
Author: Xiao Li <gatorsmile@gmail.com>
Closes#18379 from gatorsmile/CleanSample.
## What changes were proposed in this pull request?
Currently we do a lot of validations for subquery in the Analyzer. We should move them to CheckAnalysis which is the framework to catch and report Analysis errors. This was mentioned as a review comment in SPARK-18874.
## How was this patch tested?
Exists tests + A few tests added to SQLQueryTestSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17713 from dilipbiswal/subquery_checkanalysis.
## What changes were proposed in this pull request?
* Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
* Filled in some missing parentheses
## How was this patch tested?
N/A
Author: Ong Ming Yang <me@ongmingyang.com>
Closes#18398 from ongmingyang/master.