## What changes were proposed in this pull request?
This PR changes the codes to lazily init hive metastore client so that we can create SparkSession without talking to the hive metastore sever.
It's pretty helpful when you set a hive metastore server but it's down. You can still start the Spark shell to debug.
## How was this patch tested?
The new unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18944 from zsxwing/hive-lazy-init.
## What changes were proposed in this pull request?
This PR adds `since` annotation in documentation so that this can be rendered as below:
<img width="290" alt="2017-08-14 6 54 26" src="https://user-images.githubusercontent.com/6477701/29267050-034c1f64-8122-11e7-862b-7dfc38e292bf.png">
## How was this patch tested?
Manually checked the documentation by `cd sql && ./create-docs.sh`.
Also, Jenkins tests are required.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18939 from HyukjinKwon/add-sinces-date-functions.
## What changes were proposed in this pull request?
We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size.
## How was this patch tested?
Added a test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18934 from viirya/SPARK-21721.
## What changes were proposed in this pull request?
Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file.
## How was this patch tested?
Jenkins.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18928 from zsxwing/SPARK-21696.
## What changes were proposed in this pull request?
Recently, we have also encountered such NPE issues in our production environment as described in:
https://issues.apache.org/jira/browse/SPARK-19471
This issue can be reproduced by the following examples:
` val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4))).toDF("x", "y")
//HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),sum("y")).show()
//ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()
//SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false &&SQLConf.USE_OBJECT_HASH_AGG.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()`
`
This PR is based on PR-16820(https://github.com/apache/spark/pull/16820) with test cases for all aggregation paths. We want to push it forward.
> When AggregationIterator generates result projection, it does not call the initialize method of the Projection class. This will cause a runtime NullPointerException when the projection involves nondeterministic expressions.
## How was this patch tested?
unit test
verified in production environment
Author: donnyzone <wellfengzhu@gmail.com>
Closes#18920 from DonnyZone/Branch-spark-19471.
## What changes were proposed in this pull request?
At present, in test("broadcasted hash outer join operator selection") case, set the testData2 to _CACHE TABLE_, but no _uncache table_ testData2. It can make people confused.
In addition, in the joinsuite test cases, clear the cache table of work by SharedSQLContext _spark.sharedState.cacheManager.clearCache_ to do, so we do not need to uncache table
let's fix it. thanks.
## How was this patch tested?
Existing test cases.
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes#18914 from heary-cao/uncache_table.
## What changes were proposed in this pull request?
While discovering optimization rules and their test coverage, I did not find any tests for `CheckCartesianProducts` in the Catalyst folder. So, I decided to create a new test suite. Once I finished, I found a test in `JoinSuite` for this functionality so feel free to discard this change if it does not make much sense. The proposed test suite covers a few additional use cases.
Author: aokolnychyi <anton.okolnychyi@sap.com>
Closes#18909 from aokolnychyi/check-cartesian-join-tests.
## What changes were proposed in this pull request?
Jira : https://issues.apache.org/jira/browse/SPARK-19122
`leftKeys` and `rightKeys` in `SortMergeJoinExec` are altered based on the ordering of join keys in the child's `outputPartitioning`. This is done everytime `requiredChildDistribution` is invoked during query planning.
## How was this patch tested?
- Added new test case
- Existing tests
Author: Tejas Patil <tejasp@fb.com>
Closes#16985 from tejasapatil/SPARK-19122_join_order_shuffle.
## What changes were proposed in this pull request?
[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).
Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.
## How was this patch tested?
Added unit tests
Author: Tejas Patil <tejasp@fb.com>
Closes#18843 from tejasapatil/SPARK-21595.
Add an option to the JDBC data source to initialize the environment of the remote database session
## What changes were proposed in this pull request?
This proposes an option to the JDBC datasource, tentatively called " sessionInitStatement" to implement the functionality of session initialization present for example in the Sqoop connector for Oracle (see https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements ) . After each database session is opened to the remote DB, and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block in the case of Oracle).
See also https://issues.apache.org/jira/browse/SPARK-21519
## How was this patch tested?
Manually tested using Spark SQL data source and Oracle JDBC
Author: LucaCanali <luca.canali@cern.ch>
Closes#18724 from LucaCanali/JDBC_datasource_sessionInitStatement.
## What changes were proposed in this pull request?
This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption.
## How was this patch tested?
Removed the test case.
Author: Reynold Xin <rxin@databricks.com>
Closes#18912 from rxin/remove-getTableOption.
## What changes were proposed in this pull request?
This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it.
The interface basically consists of 3 traits:
- `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask`
The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`.
- `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance.
- `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver.
Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations.
## How was this patch tested?
Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc.
Author: Adrian Ionescu <adrian@databricks.com>
Closes#18884 from adrian-ionescu/write-stats-tracker-api.
## What changes were proposed in this pull request?
Currently `df.na.replace("*", Map[String, String]("NULL" -> null))` will produce exception.
This PR enables passing null/None as value in the replacement map in DataFrame.replace().
Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type.
This PR enables following operations for example:
`df.na.replace("*", Map[String, String]("NULL" -> null))`(scala)
`df.na.replace("*", Map[Any, Any](60 -> null, 70 -> 80))`(scala)
`df.na.replace('Alice', None)`(python)
`df.na.replace([10, 20])`(python, replacing with None is by default)
One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data
`df.na.replace("*", Map("" -> null)).na.drop()`(scala)
`df.replace(u'', None).dropna()`(python)
## How was this patch tested?
Scala unit test.
Python doctest and unit test.
Author: bravo-zhang <mzhang1230@gmail.com>
Closes#18820 from bravo-zhang/spark-14932.
## What changes were proposed in this pull request?
Push filter predicates through EventTimeWatermark if they're deterministic and do not reference the watermarked attribute. (This is similar but not identical to the logic for pushing through UnaryNode.)
## How was this patch tested?
unit tests
Author: Jose Torres <joseph-torres@databricks.com>
Closes#18790 from joseph-torres/SPARK-21587.
## What changes were proposed in this pull request?
This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table.
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18709 from gatorsmile/addVersion.
## What changes were proposed in this pull request?
Window rangeBetween() API should allow literal boundary, that means, the window range frame can calculate frame of double/date/timestamp.
Example of the use case can be:
```
SELECT
val_timestamp,
cate,
avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING)
FROM testData
```
This PR refactors the Window `rangeBetween` and `rowsBetween` API, while the legacy user code should still be valid.
## How was this patch tested?
Add new test cases both in `DataFrameWindowFunctionsSuite` and in `window.sql`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18814 from jiangxb1987/literal-boundary.
## What changes were proposed in this pull request?
When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.
This PR ensures that places calling HDFSMetadataLog.get always check the return value.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18799 from zsxwing/SPARK-21596.
## What changes were proposed in this pull request?
Taking over https://github.com/apache/spark/pull/18789 ; Closes#18789
Update Jackson to 2.6.7 uniformly, and some components to 2.6.7.1, to get some fixes and prep for Scala 2.12
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#18881 from srowen/SPARK-20433.
## What changes were proposed in this pull request?
If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset.
A reproducible case looks like:
object C {
type TwoInt = (Int, Int)
def tupleTypeAlias: TwoInt = (1, 1)
}
Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))
It throws an exception like:
type T1 is not a class
scala.ScalaReflectionException: type T1 is not a class
at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
...
This patch accesses the dealias of type in many places in `ScalaReflection` to fix it.
## How was this patch tested?
Added test case.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18813 from viirya/SPARK-21567.
## What changes were proposed in this pull request?
This commit adds a new argument for IllegalArgumentException message. This recent commit added the argument:
[dcac1d57f0)
## How was this patch tested?
Unit test have been passed
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Marcos P. Sanchez <mpenate@stratio.com>
Closes#18862 from mpenate/feature/exception-errorifexists.
### What changes were proposed in this pull request?
```SQL
CREATE TABLE mytesttable1
USING org.apache.spark.sql.jdbc
OPTIONS (
url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
dbtable 'mytesttable1',
paritionColumn 'state_id',
lowerBound '0',
upperBound '52',
numPartitions '53',
fetchSize '10000'
)
```
The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error.
```
AssertionError: assertion failed
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
```
### How was this patch tested?
Added a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#18864 from gatorsmile/jdbcPartCol.
## What changes were proposed in this pull request?
Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes.
## How was this patch tested?
new unit test, which was verified to fail before the fix
Author: Jose Torres <joseph-torres@databricks.com>
Closes#18840 from joseph-torres/SPARK-21565.
## What changes were proposed in this pull request?
Enhanced some existing documentation
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Mac <maclockard@gmail.com>
Closes#18710 from maclockard/maclockard-patch-1.
### What changes were proposed in this pull request?
author: BoleynSu
closes https://github.com/apache/spark/pull/18836
```Scala
val df = Seq((1, 1)).toDF("i", "j")
df.createOrReplaceTempView("T")
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " +
"cross join T t2 where t2.i = t1.i").explain(true)
}
```
The above code could cause the following exception:
```
SortMergeJoinExec should not take Cross as the JoinType
java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType
at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100)
```
Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue.
### How was this patch tested?
Modified the two existing test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Author: Boleyn Su <boleyn.su@gmail.com>
Closes#18863 from gatorsmile/pr-18836.
## What changes were proposed in this pull request?
**For moudle below:**
common/network-common
streaming
sql/core
sql/catalyst
**tests.jar will install or deploy twice.Like:**
`[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
[INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
[DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged`
**The reason is below:**
`[DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11🫙2.1.0-mdh2.1.0.1-SNAPSHOT
[DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11🫙tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark
-streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
-mdh2.1.0.1-SNAPSHOT]`
when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be overrided.
## How was this patch tested?
Execute 'mvn clean install -Pyarn -Phadoop-2.6 -Phadoop-provided -DskipTests'
Author: zhoukang <zhoukang199191@gmail.com>
Closes#18745 from caneGuy/zhoukang/fix-installtwice.
## What changes were proposed in this pull request?
Remove duplicate test-jar:test spark-sql dependency from Hive module; move test-jar dependencies together logically. This generates a big warning at the start of the Maven build otherwise.
## How was this patch tested?
Existing build. No functional changes here.
Author: Sean Owen <sowen@cloudera.com>
Closes#18858 from srowen/DupeSqlTestDep.
I have discovered that "full_outer" name option is working in Spark 2.0, but it is not printed in exception. Please verify.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: BartekH <bartekhamielec@gmail.com>
Closes#17985 from BartekH/patch-1.
## What changes were proposed in this pull request?
This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185.
## How was this patch tested?
Existing tests
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18857 from maropu/SPARK-20963-FOLLOWUP.
## What changes were proposed in this pull request?
Since Spark 2.0.0, SET hive config commands do not pass the values to HiveClient, this PR point out user to set hive config before SparkSession is initialized when they try to set hive config.
## How was this patch tested?
manual tests
<img width="1637" alt="spark-set" src="https://user-images.githubusercontent.com/5399861/29001141-03f943ee-7ab3-11e7-8584-ba5a5e81f6ad.png">
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18769 from wangyum/SPARK-21574.
## What changes were proposed in this pull request?
In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter
Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue)
## How was this patch tested?
Added unit test
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#18852 from vinodkc/br_Fix_SPARK-21588.
## What changes were proposed in this pull request?
This pr added parsing rules to support column aliases for join relations in FROM clause.
This pr is a sub-task of #18079.
## How was this patch tested?
Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18772 from maropu/SPARK-20963-2.
## What changes were proposed in this pull request?
When we use `bin/spark-sql` command configuring `--conf spark.hadoop.foo=bar`, the `SparkSQLCliDriver` initializes an instance of hiveconf, it does not add `foo->bar` to it.
this pr gets `spark.hadoop.*` properties from sysProps to this hiveconf
## How was this patch tested?
UT
Author: hzyaoqin <hzyaoqin@corp.netease.com>
Author: Kent Yao <yaooqinn@hotmail.com>
Closes#18668 from yaooqinn/SPARK-21451.
## What changes were proposed in this pull request?
This PR includes the changes to make the string "errorifexists" also valid for ErrorIfExists save mode.
## How was this patch tested?
Unit tests and manual tests
Author: arodriguez <arodriguez@arodriguez.stratio>
Closes#18844 from ardlema/SPARK-21640.
## What changes were proposed in this pull request?
This PR proposes to separate `extended` into `examples` and `arguments` internally so that both can be separately documented and add `since` and `note` for additional information.
For `since`, it looks users sometimes get confused by, up to my knowledge, missing version information. For example, see https://www.mail-archive.com/userspark.apache.org/msg64798.html
For few good examples to check the built documentation, please see both:
`from_json` - https://spark-test.github.io/sparksqldoc/#from_json
`like` - https://spark-test.github.io/sparksqldoc/#like
For `DESCRIBE FUNCTION`, `note` and `since` are added as below:
```
> DESCRIBE FUNCTION EXTENDED rlike;
...
Extended Usage:
Arguments:
...
Examples:
...
Note:
Use LIKE to match with simple string pattern
```
```
> DESCRIBE FUNCTION EXTENDED to_json;
...
Examples:
...
Since: 2.2.0
```
For the complete documentation, see https://spark-test.github.io/sparksqldoc/
## How was this patch tested?
Manual tests and existing tests. Please see https://spark-test.github.io/sparksqldoc
Jenkins tests are needed to double check
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#18749 from HyukjinKwon/followup-sql-doc-gen.
## What changes were proposed in this pull request?
create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);
`select 3, 4, sum(b) from data group by 1, 2;`
`select 3 as c, 4 as d, sum(b) from data group by c, d;`
When running these two cases, the following exception occurred:
`Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`
The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.
The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
## How was this patch tested?
Added unit test case
Author: liuxian <liu.xian3@zte.com.cn>
Closes#18779 from 10110346/groupby.
## What changes were proposed in this pull request?
This PR replaces #18623 to do some clean up.
Closes#18623
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Author: Andrey Taptunov <taptunov@amazon.com>
Closes#18848 from zsxwing/review-pr18623.
## What changes were proposed in this pull request?
OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class.
This blocks SPARK-21619.
## How was this patch tested?
Should be covered by existing test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#18839 from rxin/SPARK-21634.
## What changes were proposed in this pull request?
Hive `pmod(3.13, 0)`:
```:sql
hive> select pmod(3.13, 0);
OK
NULL
Time taken: 2.514 seconds, Fetched: 1 row(s)
hive>
```
Spark `mod(3.13, 0)`:
```:sql
spark-sql> select mod(3.13, 0);
NULL
spark-sql>
```
But the Spark `pmod(3.13, 0)`:
```:sql
spark-sql> select pmod(3.13, 0);
17/06/25 09:35:58 ERROR SparkSQLDriver: Failed in [select pmod(3.13, 0)]
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.Pmod.pmod(arithmetic.scala:504)
at org.apache.spark.sql.catalyst.expressions.Pmod.nullSafeEval(arithmetic.scala:432)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:323)
...
```
This PR make `pmod(number, 0)` to null.
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#18413 from wangyum/SPARK-21205.
## What changes were proposed in this pull request?
An overflow of the difference of bounds on the partitioning column leads to no data being read. This
patch checks for this overflow.
## How was this patch tested?
New unit test.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#18800 from aray/SPARK-21330.
## What changes were proposed in this pull request?
In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.
## How was this patch tested?
A new test case is added in StatisticsSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#18804 from dilipbiswal/datasource_stats.
## What changes were proposed in this pull request?
Error class name for log in several classes. such as:
`2017-08-02 16:43:37,695 INFO CompositeService: Operation log root directory is created: /tmp/mr/operation_logs`
`Operation log root directory is created ... ` is in `SessionManager.java` actually.
## How was this patch tested?
manual tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#18816 from zuotingbing/SPARK-21611.
## What changes were proposed in this pull request?
if the object extends Logging, i suggest to remove the var LOG which is useless.
## How was this patch tested?
Exist tests
Author: zuotingbing <zuo.tingbing9@zte.com.cn>
Closes#18811 from zuotingbing/SPARK-21604.
## What changes were proposed in this pull request?
When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue.
## How was this patch tested?
The new unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18822 from zsxwing/SPARK-21546.
## What changes were proposed in this pull request?
This PR fixed a potential overflow issue in EventTimeStats.
## How was this patch tested?
The new unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#18803 from zsxwing/avg.
## What changes were proposed in this pull request?
Currently, StructType.merge() only reports data types of conflicting fields when merging two incompatible schemas. It would be nice to also report the field names for easier debugging.
## How was this patch tested?
Unit test in DataTypeSuite.
Print exception message when conflict is triggered.
Author: bravo-zhang <mzhang1230@gmail.com>
Closes#16365 from bravo-zhang/spark-18950.
## What changes were proposed in this pull request?
When BytesToBytesMap spills, its longArray should be released. Otherwise, it may not released until the task complete. This array may take a significant amount of memory, which cannot be used by later operator, such as UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This patch release the array as destructive iterator will not use this array anymore.
## How was this patch tested?
Manual test in production
Author: Zhan Zhang <zhanzhang@fb.com>
Closes#17180 from zhzhan/memory.
## What changes were proposed in this pull request?
The format of none should be consistent with other compression codec(\`snappy\`, \`lz4\`) as \`none\`.
## How was this patch tested?
This is a typo.
Author: GuoChenzhao <chenzhao.guo@intel.com>
Closes#18758 from gczsjdy/typo.
## What changes were proposed in this pull request?
This pr added parsing rules to support subquery column aliases in FROM clause.
This pr is a sub-task of #18079.
## How was this patch tested?
Added tests in `PlanParserSuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#18185 from maropu/SPARK-20962.
## What changes were proposed in this pull request?
Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
This PR is mostly based on Herman's previous amazing work: 596f53c339
After this been merged, we can close#16818 .
## How was this patch tested?
Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#18540 from jiangxb1987/rangeFrame.