### What changes were proposed in this pull request?
Fixing identification of timed-out pending pod requests as excess requests to delete when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests too.
### Why are the changes needed?
After https://github.com/apache/spark/pull/29981 only timed out newly created requests and timed out pending requests are taken as excess request.
But there is small bug when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests as well. Because all the newly created requests are counted as excess request when items are chosen from the timed out pod pending requests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
There is new unit test added: `SPARK-34334: correctly identify timed out pending pod requests as excess`.
Closes#31445 from attilapiros/SPARK-34334.
Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Increase the rel tol from 0.2 to 0.35.
### Why are the changes needed?
Fix flaky test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT.
Closes#31536 from WeichenXu123/ES-65815.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When doing https://issues.apache.org/jira/browse/SPARK-34399 based on https://github.com/apache/spark/pull/31471
Found FileBatchWrite will use `FileFormatWrite.processStates()` too. We need log commit duration in other writer too.
In this pr:
1. Extract a commit job method in SparkHadoopWriter
2. address other commit writer
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Closes#31520 from AngersZhuuuu/SPARK-34355-followup.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR aims to fix `KubernetesClusterSchedulerBackend.stop` to wrap `super.stop` with `Utils.tryLogNonFatalError`.
### Why are the changes needed?
[CoarseGrainedSchedulerBackend.stop](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L559) may throw `SparkException` and this causes K8s resource (pod and configmap) leakage.
### Does this PR introduce _any_ user-facing change?
No. This is a bug fix.
### How was this patch tested?
Pass the CI with the newly added test case.
Closes#31533 from dongjoon-hyun/SPARK-34407.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The getMetricsSnapshot method of the PrometheusServlet class has a wrong value, It should be taking the mean value but it's taking the max value.
### Why are the changes needed?
The mean value of timersLabels in the PrometheusServlet class is wrong, You can look at line 105 of this class: L105.
```
sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n")
```
it should be
```
sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMean}\n")
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
![image](https://user-images.githubusercontent.com/5170878/107313576-cc199280-6acd-11eb-9384-b6abf71c0f90.png)
Closes#31532 from 397090770/SPARK-34405.
Authored-by: wyp <wyphao.2007@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently, we pass the default value `EmptyRow` to method `checkEvaluation` in the `StringExpressionsSuite`, but the default value of the 'checkEvaluation' method parameter is the `emptyRow`.
We can clean the parameter for Code Simplifications.
### Why are the changes needed?
for Code Simplifications
**before**:
```
def testConcat(inputs: String*): Unit = {
val expected = if (inputs.contains(null)) null else inputs.mkString
checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected, EmptyRow)
}
```
**after**:
```
def testConcat(inputs: String*): Unit = {
val expected = if (inputs.contains(null)) null else inputs.mkString
checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected)
}
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or Github action.
Closes#31510 from yikf/master.
Authored-by: yikf <13468507104@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The current implement of `SQLQueryTestSuite` cannot run on windows system.
Becasue the code below will fail on windows system:
`assume(TestUtils.testCommandAvailable("/bin/bash"))`
For operation system that cannot support `/bin/bash`, we just skip some tests.
### Why are the changes needed?
SQLQueryTestSuite has a bug on windows system.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
Jenkins test
Closes#31466 from beliefer/SPARK-34352.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use standard methods to extract `keys` or `values` from a `Map`, it's semantically consistent and use the `DefaultKeySet` and `DefaultValuesIterable` instead of a manual loop.
**Before**
```
map.map(_._1)
map.map(_._2)
```
**After**
```
map.keys
map.values
```
### Why are the changes needed?
Code Simpilefications.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31484 from LuciferYang/keys-and-values.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side.
### Why are the changes needed?
Improve performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
adding new ut
Closes#31258 from JkSelf/supportDPP1.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to propagate the name used for registering UDFs to `ScalaUDF`, `ScalaUDAF` and `ScaalAggregator`.
Note that `PythonUDF` gets the name correctly: 466c045bfa/python/pyspark/sql/udf.py (L358-L359)
, and same for Hive UDFs:
466c045bfa/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala (L67)
### Why are the changes needed?
This PR can help in the following scenarios:
1) Better EXPLAIN output
2) By adding `def name: String` to `UserDefinedExpression`, we can match an expression by `UserDefinedExpression` and look up the catalog, an use case needed for #31273.
### Does this PR introduce _any_ user-facing change?
The EXPLAIN output involving udfs will be changed to use the name used for UDF registration.
For example, for the following:
```
sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'")
sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true)
```
The output of the optimized plan will change from:
```
Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330846906be0f, 1, 1) AS spark33084(col1)#237]
+- LocalRelation [col1#223]
```
to
```
Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330847a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)#237]
+- LocalRelation [col1#223]
```
### How was this patch tested?
Added new tests.
Closes#31500 from imback82/udaf_name.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Adds an assert to `FixedLengthRowBasedKeyValueBatch#appendRow` method to check the incoming vlen and klen by comparing them with the lengths stored as member variables as followup to https://github.com/apache/spark/pull/30788
### Why are the changes needed?
Add assert statement to catch similar bugs in future.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ran some tests locally, though not easy to test.
Closes#31447 from yliou/SPARK-33726-Assert.
Authored-by: yliou <yliou@berkeley.edu>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In Spark, `set -v` is defined as "Queries all properties that are defined in the SQLConf of the sparkSession".
But there are other external modules that also define properties and register them to SQLConf. In this case,
it can't be displayed by `set -v` until the conf object is initiated (i.e. calling the object at least once).
In this PR, I propose to eagerly initiate all the objects registered to SQLConf, so that `set -v` will always output
the completed properties.
### Why are the changes needed?
Improve the `set -v` command to produces completed and deterministic results
### Does this PR introduce _any_ user-facing change?
`set -v` command will dump more configs
### How was this patch tested?
existing tests
Closes#30363 from linhongliu-db/set-v.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose new options for the Parquet datasource:
1. `datetimeRebaseMode`
2. `int96RebaseMode`
Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps.
The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely;
- `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar.
- `"CORRECTED"`, dates/timestamps are read AS IS from parquet files.
- `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.
### Why are the changes needed?
1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance:
```scala
val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1)
val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2)
df1.join(df2, ...)
```
Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads.
2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps:
```scala
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy")
spark.read.parquet(folder).distinct.rdd.collect()
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite"
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite"
```
Closes#31489 from MaxGekk/parquet-rebase-options.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/31245:
```
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:112:53: value deep is not a member of Array[String]
[error] assert(sql("show tables").schema.fieldNames.deep ==
[error] ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:115:72: value deep is not a member of Array[String]
[error] assert(sql("show table extended like 'tbl'").schema.fieldNames.deep ==
[error] ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:121:55: value deep is not a member of Array[String]
[error] assert(sql("show tables").schema.fieldNames.deep ==
[error] ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:124:74: value deep is not a member of Array[String]
[error] assert(sql("show table extended like 'tbl'").schema.fieldNames.deep ==
[error] ^
```
It broke Scala 2.13 build. This PR works around by using ScalaTests' `===` that can compare `Array`s safely.
### Why are the changes needed?
To fix the build.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
CI in this PR should test it out.
Closes#31526 from HyukjinKwon/SPARK-34157.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add a note about pip installation test in RC for release vote template.
### Why are the changes needed?
To promote PySpark users to test PyPi distribution and pip installation.
### Does this PR introduce _any_ user-facing change?
No. It will be used for release vote.
### How was this patch tested?
N/A
Closes#31527 from HyukjinKwon/minor-update-vote-templ.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove an unnecessary quote in the documentation.
Super trivial.
### Why are the changes needed?
Fix a mistake.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Just doc
Closes#31523 from gengliangwang/removeQuote.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Keep consistence with other `SHOW` command according to https://github.com/apache/spark/pull/31341#issuecomment-774613080
### Why are the changes needed?
Keep consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#31518 from AngersZhuuuu/SPARK-34239-followup.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The current implement of some DDL not unify the output and not pass the output properly to physical command.
Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`.
As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`.
Take `show tables` and `show table extended like 'tbl'` as example.
The output before this PR:
`show tables`
|database|tableName|isTemporary|
-- | -- | --
| default| tbl| false|
If catalog is v2 session catalog, the output before this PR:
|namespace|tableName|
-- | --
| default| tbl
`show table extended like 'tbl'`
|database|tableName|isTemporary| information|
-- | -- | -- | --
| default| tbl| false|Database: default...|
The output after this PR:
`show tables`
|namespace|tableName|isTemporary|
-- | -- | --
| default| tbl| false|
`show table extended like 'tbl'`
|namespace|tableName|isTemporary| information|
-- | -- | -- | --
| default| tbl| false|Database: default...|
### Why are the changes needed?
This PR have benefits as follows:
First, Unify schema for the output of SHOW TABLES.
Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe.
### Does this PR introduce _any_ user-facing change?
Yes.
The output schema of `SHOW TABLES` replace `database` by `namespace`.
### How was this patch tested?
Jenkins test.
Closes#31245 from beliefer/SPARK-34157.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add some info log around commit log.
### Why are the changes needed?
Th commit job is a heavy option and we have seen many times Spark block at this code place due to the slow rpc with namenode or other.
It's better to record the time that commit job cost.
### Does this PR introduce _any_ user-facing change?
Yes, more info log.
### How was this patch tested?
Not need.
Closes#31471 from ulysses-you/add-commit-log.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
`Mockito.initMocks(Object)` is a deprecated api, should use `Mockito.openMocks(Object).close()` instead.
### Why are the changes needed?
Cleanup deprecation api usage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31487 from LuciferYang/mockito-api.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up for SPARK-34346 which causes a flakiness due to `core-site.xml` test resource file addition. This PR aims to remove the test resource `core/src/test/resources/core-site.xml` from `core` module.
### Why are the changes needed?
Due to the test resource `core-site.xml`, YARN UT becomes flaky in GitHub Action and Jenkins.
```
$ build/sbt "yarn/testOnly *.YarnClusterSuite -- -z SPARK-16414" -Pyarn
...
[info] YarnClusterSuite:
[info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) *** FAILED *** (20 seconds, 209 milliseconds)
[info] FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:210)
```
To isolate more, we may use `SPARK_TEST_HADOOP_CONF_DIR` like `yarn` module's `yarn/Client`, but it seems an overkill in `core` module.
```
// SPARK-23630: during testing, Spark scripts filter out hadoop conf dirs so that user's
// environments do not interfere with tests. This allows a special env variable during
// tests so that custom conf dirs can be used by unit tests.
val confDirs = Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR") ++
(if (Utils.isTesting) Seq("SPARK_TEST_HADOOP_CONF_DIR") else Nil)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#31515 from dongjoon-hyun/SPARK-34346-2.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Update the Incorrect URL of the only developer Matei in pom.xml
### Why are the changes needed?
The current link was broken
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
change the link to https://cs.stanford.edu/people/matei/Closes#31512 from pingsutw/SPARK-34158.
Authored-by: Kevin <pingsutw@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
docs/pyspark-migration-guide.md
### Why are the changes needed?
broken link
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually build and check
Closes#31514 from raphaelauv/patch-2.
Authored-by: raphaelauv <raphaelauv@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr format DateLiteral and TimestampLiteral toString. For example:
```sql
SELECT * FROM date_dim WHERE d_date BETWEEN (cast('2000-03-11' AS DATE) - INTERVAL 30 days) AND (cast('2000-03-11' AS DATE) + INTERVAL 30 days)
```
Before this pr:
```
Condition : (((isnotnull(d_date#18) AND (d_date#18 >= 10997)) AND (d_date#18 <= 11057))
```
After this pr:
```
Condition : (((isnotnull(d_date#14) AND (d_date#14 >= 2000-02-10)) AND (d_date#14 <= 2000-04-10))
```
### Why are the changes needed?
Make the plan more readable.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#31455 from wangyum/SPARK-34342.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
With https://github.com/apache/spark/pull/31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`.
Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`.
### Why are the changes needed?
Without this PR the problem described in https://github.com/apache/spark/pull/31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`.
So for example when a new column (with a default value) is added to the table then one the following problem happens:
- when the new field is added after the last one the cell values will be null values instead of the default value
- when the schema is extended somewhere before the last field then values will be listed for the wrong column positions
Similar error will happen when one of the field is removed from the schema.
For details please check the attached unit tests where both cases are checked.
### Does this PR introduce _any_ user-facing change?
Fixes the potential value error.
### How was this patch tested?
The existing unit tests for schema evolution is generalized and reused.
New tests:
- `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)`
- `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)`
Closes#31501 from attilapiros/SPARK-34370.
Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Unwrap `SparkUpgradeException` from `ParquetDecodingException` in v2 `FilePartitionReader` in the same way as v1 implementation does: 3a299aa648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala (L180-L183)
### Why are the changes needed?
1. To be compatible with v1 implementation of the Parquet datasource.
2. To improve UX with Spark SQL by making `SparkUpgradeException` more visible.
### Does this PR introduce _any_ user-facing change?
Yes, it can.
### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite"
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite"
```
Closes#31497 from MaxGekk/parquet-spark-upgrade-exception.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to add ZStandardBenchmark as a base-line.
### Why are the changes needed?
This will prevent any regression when we upgrade Zstandard library in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually.
Closes#31498 from dongjoon-hyun/SPARK-ZSTD-BENCH.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1, clear predictionCol & probabilityCol, use tmp rawPred col, to avoid potential column conflict;
2, use array instead of map, to keep in line with the python side;
3, simplify transform
### Why are the changes needed?
if input dataset has a column whose name is `predictionCol`,`probabilityCol`,`RawPredictionCol`, transfrom will fail.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuite
Closes#31472 from zhengruifeng/ovr_submodel_skip_pred_prob.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Rewrote one `ExtractGenerator` case such that it would not rely on a side effect of the flatmap function.
### Why are the changes needed?
With the dataframe api it is possible to have a lazy sequence as the `output` of a `LogicalPlan`. When exploding a column on this dataframe using the `withColumn("newName", explode(col("name")))` method, the `ExtractGenerator` does not extract the generator and `CheckAnalysis` would throw an exception.
### Does this PR introduce _any_ user-facing change?
Bugfix
Before this, the work around was to put `.select("*")` before the explode.
### How was this patch tested?
UT
Closes#31213 from tanelk/SPARK-34141_extract_generator.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
`hashDistance` optimization: if two vectors in a pair are the same, directly return 0.0
### Why are the changes needed?
it should be faster than existing impl, because of short-circuit
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31394 from zhengruifeng/min_hash_distance_opt.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data
the table level properties were overwritten by the partition level properties.
This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema
is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data.
This new behavior is consistent with Apache Hive.
See the example used in the unit test `SPARK-26836: support Avro schema evolution`, in Hive this results in:
```
0: jdbc:hive2://<IP>:10000> select * from t;
INFO : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds
INFO : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t
INFO : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds
INFO : OK
+---------------+-------------+-------------+
| t.col1 | t.col2 | t.ds |
+---------------+-------------+-------------+
| col1_default | col2_value | 1981-01-07 |
| col1_value | col2_value | 1983-04-27 |
+---------------+-------------+-------------+
2 rows selected (0.159 seconds)
```
### Why are the changes needed?
Without this change the old schema would be used. This can use a correctness issue when the new schema introduces
a new field with a default value (following the rules of schema evolution) before an existing field.
In this case the rows coming from the partition where the old schema was used will **contain values in wrong column positions**.
For example check the attached unit test `SPARK-26836: support Avro schema evolution`
Without this fix the result of the select on the table would be:
```
+----------+----------+----------+
| col1| col2| ds|
+----------+----------+----------+
|col2_value| null|1981-01-07|
|col1_value|col2_value|1983-04-27|
+----------+----------+----------+
```
With this fix:
```
+------------+----------+----------+
| col1| col2| ds|
+------------+----------+----------+
|col1_default|col2_value|1981-01-07|
| col1_value|col2_value|1983-04-27|
+------------+----------+----------+
```
### Does this PR introduce _any_ user-facing change?
Just fixes the value errors.
When a new column is introduced even to the last position then instead of 'null' the given default will be used.
### How was this patch tested?
This was tested with the unit tested included to the PR.
And manually on Apache Spark / Hive.
Closes#31133 from attilapiros/SPARK-26836.
Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR replaces `withTimeZone` defined and used in `OracleIntegrationSuite` with `DateTimeTestUtils.withDefaultTimeZone` which is defined as a utility method.
### Why are the changes needed?
Both methods are semantically the same so it might be better to use the utility one.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`OracleIntegrationSuite` passes.
Closes#31465 from sarutak/oracle-timezone-util.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
As a followup from discussion in https://github.com/apache/spark/pull/29804#discussion_r493100510 . Currently in data source v1 file scan `FileSourceScanExec`, [bucket filter pruning will only take effect with bucket table scan](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L542 ). However this is unnecessary, as bucket filter pruning can also happen if we disable bucketed table scan. Read files with bucket hash partitioning, and bucket filter pruning are two orthogonal features, and do not need to couple together.
### Why are the changes needed?
This help query leverage the benefit from bucket filter pruning to save CPU/IO to not read unnecessary bucket files, and do not bound by bucket table scan when the parallelism of tasks is a concern.
In addition, this also resolves the issue to reduce number of tasks launched for simple query with bucket column filter - SPARK-33207, because with bucket scan, we launch # of tasks to equal to # of buckets, and this is unnecessary.
### Does this PR introduce _any_ user-facing change?
Users will notice query to start pruning irrelevant files for reading bucketed table, when disabling bucketing. If the input data does not follow spark data source bucketing convention, by default exception will be thrown and query will be failed. The exception can be bypassed with setting config `spark.sql.files.ignoreCorruptFiles` to true.
### How was this patch tested?
Added unit test in `BucketedReadSuite.scala` to make all existing unit tests for bucket filter work with this PR.
Closes#31413 from c21/bucket-pruning.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/28027https://github.com/apache/spark/pull/28027 added a DS v2 API that allows data sources to produce metadata/hidden columns that can only be seen when it's explicitly selected. The way we integrate this API into Spark is:
1. The v2 relation gets normal output and metadata output from the data source, and the metadata output is excluded from the plan output by default.
2. column resolution can resolve `UnresolvedAttribute` with metadata columns, even if the child plan doesn't output metadata columns.
3. An analyzer rule searches the query plan, trying to find a node that has missing inputs. If such node is found, transform the sub-plan of this node, and update the v2 relation to include the metadata output.
The analyzer rule in step 3 brings a perf regression, for queries that do not read v2 tables at all. This rule will calculate `QueryPlan.inputSet` (which builds an `AttributeSet` from outputs of all children) and `QueryPlan.missingInput` (which does a set exclusion and creates a new `AttributeSet`) for every plan node in the query plan. In our benchmark, the TPCDS query compilation time gets increased by more than 10%
This PR proposes a simple way to improve it: we add a special metadata entry to the metadata attribute, which allows us to quickly check if a plan needs to add metadata columns: we just check all the references of this plan, and see if the attribute contains the special metadata entry, instead of calculating `QueryPlan.missingInput`.
This PR also fixes one bug: we should not change the final output schema of the plan, if we only use metadata columns in operators like filter, sort, etc.
### Why are the changes needed?
Fix perf regression in SQL query compilation, and fix a bug.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run `org.apache.spark.sql.TPCDSQuerySuite`, before this PR, `AddMetadataColumns` is the top 4 rule ranked by running time
```
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 407641
Total time: 47.257239779 seconds
Rule Effective Time / Total Time Effective Runs / Total Runs
OptimizeSubqueries 4157690003 / 8485444626 49 / 2778
Analyzer$ResolveAggregateFunctions 1238968711 / 3369351761 49 / 2141
ColumnPruning 660038236 / 2924755292 338 / 6391
Analyzer$AddMetadataColumns 0 / 2918352992 0 / 2151
```
after this PR:
```
Analyzer$AddMetadataColumns 0 / 122885629 0 / 2151
```
This rule is 20 times faster and is negligible to the total compilation time.
This PR also add new tests to verify the bug fix.
Closes#31440 from cloud-fan/metadata-col.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Extract the date/timestamps rebasing tests from `ParquetIOSuite` to `ParquetRebaseDatetimeSuite` to run them for both DSv1 and DSv2 implementations of Parquet datasource.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite"
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite"
$ build/sbt "sql/test:testOnly *ParquetIOSuite"
```
Closes#31478 from MaxGekk/rebase-tests-dsv1-and-dsv2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR added tests for some non-array types in PostgreSQL.
PostgreSQL supports wide range of types (https://www.postgresql.org/docs/13/datatype.html) and `PostgresIntegrationSuite` contains tests for some types but ones for the following types are missing.
* bit varying
* point
* line
* lseg
* box
* path
* polygon
* circle
* pg_lsn
* macaddr
* macaddr8
* numeric
* pg_snapshot
* real
* time
* timestamp
* tsquery
* tsvector
* txid_snapshot
* xml
NOTE: Handling money types can be buggy so this PR doesn't add tests for those types.
### Why are the changes needed?
To ensure those types work with Spark well.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Extended `PostgresIntegrationSuite`.
Closes#31456 from sarutak/test-for-some-types-postgresql.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/26006
In #26006 , we merged the v1 and v2 SHOW DATABASES/NAMESPACES commands, but we missed a behavior change that the output schema of SHOW DATABASES becomes different.
This PR adds a legacy config to restore the old schema, with a migration guide item to mention this behavior change.
### Why are the changes needed?
Improve backward compatibility
### Does this PR introduce _any_ user-facing change?
No (the legacy config is false by default)
### How was this patch tested?
a new test
Closes#31474 from cloud-fan/command-schema.
Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Param Validation throw `IllegalArgumentException`
### Why are the changes needed?
Param Validation should throw `IllegalArgumentException` instead of `IllegalStateException`
### Does this PR introduce _any_ user-facing change?
Yes, the type of exception changed
### How was this patch tested?
existing testsuites
Closes#31469 from zhengruifeng/mllib_exceptions.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.
1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`
2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml
### Why are the changes needed?
bugfix for configuration behavior and fix performance regression by that behavior change
### Does this PR introduce _any_ user-facing change?
this pr restores silent user face change
### How was this patch tested?
new tests
Closes#31460 from yaooqinn/SPARK-34346.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to expose the number of total paths in Utils.buildLocationMetadata(), with relaxing space usage a bit (around 10+ chars).
Suppose the first 2 of 5 paths are only fit to the threshold, the outputs between the twos are below:
* before the change: `[path1, path2]`
* after the change: `(5 paths)[path1, path2, ...]`
### Why are the changes needed?
SPARK-31793 silently truncates the paths hence end users can't indicate how many paths are truncated, and even more, whether paths are truncated or not.
### Does this PR introduce _any_ user-facing change?
Yes, the location metadata will also show how many paths are truncated (not shown), instead of silently truncated.
### How was this patch tested?
Modified UTs
Closes#31464 from HeartSaVioR/SPARK-34339.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Due to user-experience (confusing to Spark users - java.sql.Time using milliseconds vs Spark using microseconds; and user losing useful functions like hour(), minute(), etc on the column), we have decided to revert back to use TimestampType but this time we will enforce the hour to be consistently across system timezone (via offset manipulation) and date part fixed to zero epoch.
Full Discussion with Wenchen Fan Wenchen Fan regarding this ticket is here https://github.com/apache/spark/pull/30902#discussion_r569186823
### Why are the changes needed?
Revert and improvement to sql.Time handling
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests and integration tests
Closes#31473 from saikocat/SPARK-34357.
Authored-by: Hoa <hoameomu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/30701. It uses properties of `hadoop-client-api.artifact`, `hadoop-client-runtime.artifact` and `hadoop-client-minicluster.artifact` explicitly to set the dependencies and versions.
Otherwise, it is logically incorrect. For example, if you build with Hadoop 2, this dependency becomes `hadoop-client-api:2.7.4` internally, which does not exist in Hadoop 2 (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api).
### Why are the changes needed?
- To fix the logical incorrectness.
- It fixes a potential issue: this actually caused an issue when `generate-sources` plugin is used together with Hadoop 2 by default, which attempts to pull 2.7.4 of `hadoop-client-api`, `hadoop-client-runtime` and `hadoop-client-minicluster` for whatever reason.
### Does this PR introduce _any_ user-facing change?
No for users and dev. It's more a cleanup.
### How was this patch tested?
Manually checked the dependencies are correctly placed.
Closes#31467 from HyukjinKwon/SPARK-33212.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Skip the zinc related installation operations on aarch64 platform.
### Why are the changes needed?
The standalone zinc is not supported well on aarch64, so that the error ouput, `cannot execute binary file: Exec format error` dumped after build/mvn is called.
This patch try to skip the zinc installation and related operations on aarch64 to make sure the error output doesn't print again on aarch64.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
simple cmd: `build/mvn -v`, see no error ouput again in aarch64, and nothing changed on x86
- on AArch64 Ubuntu
```
rootyikun-arm:~/dev/spark# uname -a
Linux yikun-arm 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:10 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
rootyikun-arm:~/dev/spark# uname -m
aarch64
rootyikun-arm:~/dev/spark# build/mvn -v
Using `mvn` from path: /root/dev/spark/build/apache-maven-3.6.3/bin/mvn
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /root/dev/spark/build/apache-maven-3.6.3
Java version: 1.8.0_222, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-arm64/jre
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-70-generic", arch: "aarch64", family: "unix"
```
- on x86 Mac OS
```
# uname -a
Darwin MacBook.local 19.6.0 Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64 x86_64
# uname -m
x86_64
# build/mvn -v
Using `mvn` from path: /Users/jiangyikun/huawei/apache-maven-3.6.3/bin/mvn
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /Users/jiangyikun/huawei/apache-maven-3.6.3
Java version: 1.8.0_221, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "mac os x", version: "10.15.7", arch: "x86_64", family: "mac"
```
- on x86 Ubuntu
```
rootyikun-x86:~/spark# uname -a
Linux yikun-x86 5.4.0-58-generic #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
rootyikun-x86:~/spark# uname -m
x86_64
rootyikun-x86:~/spark# ./build//mvn -v
Using `mvn` from path: /root/spark/build/apache-maven-3.6.3/bin/mvn
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /root/spark/build/apache-maven-3.6.3
Java version: 1.8.0_275, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-58-generic", arch: "amd64", family: "unix"
```
Closes#31454 from Yikun/zinc_skip_aarch64.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted.
### Why are the changes needed?
The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified UTs explain the missing points, which also do the test.
Closes#31449 from HeartSaVioR/SPARK-34326-v2.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR aims two goals.
1. Support ZSTD JNI BufferPool feature by adding a new configuration, `spark.io.compression.zstd.bufferPool.enabled`, for Apache Spark 3.2.0.
2. Make Spark independent from ZSTD JNI library's default buffer pool policy change.
### Why are the changes needed?
ZSTD JNI library has different behaviors across its versions.
| Version | Description | Commit |
| ---------- | --------------- | ----------- |
| v1.4.5-7 | `BufferPool` was added and used it by default | 4f55c89172 |
| v1.4.5-8 | `RecyclingBufferPool` was added and `BufferPool` became an interface to allow custom BufferPool implementation | dd2588edd3 |
| v1.4.7+ | `NoPool` is used by default and user should specify buffer pool explicitly | f7c8279bc1 |
### Does this PR introduce _any_ user-facing change?
No, the default value (`false`) is consistent with the AS-IS ZSTD-JNI library's default buffer pool.
### How was this patch tested?
Pass the CIs with the updated UT.
Closes#31453 from dongjoon-hyun/SPARK-34340.
Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow up to #31424, and proposes to use `UnresolvedTable.relationTypeMismatchHint` when `UnresolvedTable` is resolved to a temp view.
### Why are the changes needed?
This change utilizes the type mismatch hint when a relation is resolved to a temp view when a table is expected.
For example, `ALTER TABLE tmpView SET TBLPROPERTIES ('p' = 'an')` will now include `Please use ALTER VIEW instead.` in the exception message: `tmpView is a temp view. 'ALTER TABLE ... SET TBLPROPERTIES' expects a table. Please use ALTER VIEW instead.`
### Does this PR introduce _any_ user-facing change?
Yes, adds the hint in the exception message.
### How was this patch tested?
Update existing tests to include the hint.
Closes#31452 from imback82/followup_SPARK-34317.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove the check `verifyAlterTableType()` from the following v1 commands:
- AlterTableAddPartitionCommand
- AlterTableDropPartitionCommand
- AlterTableRenamePartitionCommand
- AlterTableRecoverPartitionsCommand
- AlterTableSerDePropertiesCommand
- AlterTableSetLocationCommand
The check is not needed any more after migration on new resolution framework, see SPARK-29900.
Also new tests were added to:
- AlterTableAddPartitionSuiteBase
- AlterTableDropPartitionSuiteBase
- AlterTableRenamePartitionSuiteBase
- v1/AlterTableRecoverPartitionsSuite
and removed duplicate tests from `SQLViewSuite` and `HiveDDLSuite`.
The tests for `AlterTableSerDePropertiesCommand`/`AlterTableSetLocationCommand` exist in SQLViewSuite` and `HiveDDLSuite`, and they can be ported to unified tests after SPARK-34305 and SPARK-34332.
The `ALTER TABLE .. CHANGE COLUMN` command accepts only tables too, so, the check can be removed after migration on new resolution framework, SPARK-34302.
### Why are the changes needed?
To improve code maintenance by removing dead code.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
1. Added new tests to unified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
```
2. Run the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SQLViewSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveDDLSuite"
```
Closes#31405 from MaxGekk/remove-view-check-in-alter-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>