Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon.
Specifically:
- Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated.
- Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key.
- Added the boilerplate for doctests to cogroup.py. Note that cogroup.py only contains the apply() function which has doctests disabled as per the other Pandas Udfs.
- Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package.
- Some minor formatting tweaks.
This was tested by running the appropriate unit tests. I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check.
Closes#25939 from d80tb7/SPARK-27463-fixes.
Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch.
This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users.
This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative.
## How was this patch tested?
Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query.
Closes#24890 from HeartSaVioR/SPARK-28074.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
The current code uses org.apache.zookeeper:zookeeper:jar:3.4.6 and it will cause a security vulnerabilities. We could get some security info from https://www.tenable.com/cve/CVE-2019-0201
This reference remind to upgrate the version of `zookeeper` to 3.4.14 or later.
### Why are the changes needed?
This PR fix the security vulnerabilities.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Exists UT.
Closes#25933 from beliefer/upgrade-zookeeper.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators to latest maintenance release that is also cross-published for Scala 2.13.
### Why are the changes needed?
To build in the future for Scala 2.13
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#25967 from srowen/SPARK-29289.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR regenerate the benchmark results in `core` and `mllib` module in order to compare JDK8/JDK11 result.
### Why are the changes needed?
According to the result, For `PropertiesCloneBenchmark` and `UDTSerializationBenchmark`, JDK11 is slightly faster. In general, there is no regression in JDK11.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This is a test-only PR. Manually run the benchmark.
Closes#25969 from dongjoon-hyun/SPARK-29297.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson.
### Why are the changes needed?
When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
"num-children" : 1,
"name" : null,
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```
Seems the alias name was in the Json before SPARK-19602.
It is fixed by this patch:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
"num-children" : 1,
"name" : {
"product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier",
"identifier" : "t1"
},
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```
### Does this PR introduce any user-facing change?
Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier.
### How was this patch tested?
Added unit test.
Closes#25959 from viirya/SPARK-29186.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand.
We have documented it at SPARK-28373 but I think it is better to have some tooltips in the SQL statistics table to explain the columns
![image](https://user-images.githubusercontent.com/12819544/64489775-38e48980-d257-11e9-868a-5f5f6a0f1e46.png)
The columns with new tooltips are finish time, close time, execution time and duration
![image](https://user-images.githubusercontent.com/12819544/64489858-1141f100-d258-11e9-9e4e-fae3299da465.png)
Improvements in UIUtils can be used in other tables in WebUI to add tooltips
### Why are the changes needed?
It is interesting to improve the undestanding of the WebUI
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit tests are added and manual test.
Closes#25723 from planga82/feature/SPARK-29019_tooltipjdbcServer.
Lead-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to specify the JDK8 default configurations `-XX:+UseParallelGC -XX:-UseDynamicNumberOfGCThreads` explicitly. As we see in this PR [here](https://github.com/apache/spark/pull/25966/files#diff-12b89b7ee67c63c2254b749c8f8d0694R10), this will make the comparison between JDK8 and JDK11 easier by removing a misleading regression.
**NOTE THAT THESE JVM CONFS ARE ONLY FOR BENCHMARK COMPARISON, NOT FOR A PRODUCTION**
### Why are the changes needed?
There exists many JVM-level changes between JDK8 and JDK11. For example, the followings are notable changes and it turns out that especially (1) and (2) shows a misleading regression in our micro-benchmark environment because our microbenchmark uses small VM memory.
1. [JEP 248: Make G1 the Default Garbage Collector](https://bugs.openjdk.java.net/browse/JDK-8073273) **JDK9+**
2. [Enable UseDynamicNumberOfGCThreads by default](https://bugs.openjdk.java.net/browse/JDK-8198547) **JDK11+**
3. [Change default value of HeapSizePerGCThread](https://bugs.openjdk.java.net/browse/JDK-8200417) **JDK11+**
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This is a test-only JVM configuration change. Manually, run the benchmark.
Closes#25966 from dongjoon-hyun/SPARK-29282.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
HiveClientImpl may be log sensitive information. e.g. url, secret and token:
```scala
logDebug(
s"""
|Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
|$k=${if (k.toLowerCase(Locale.ROOT).contains("password")) "xxx" else v}
""".stripMargin)
```
So redact it. Use SQLConf.get.redactOptions.
I add a new overloading function to fit this situation for one by one kv pair situation.
### Why are the changes needed?
Redact sensitive information when construct HiveClientImpl
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
MT
Run command
` /sbin/start-thriftserver.sh`
In log we can get
```
19/09/28 08:27:02 main DEBUG HiveClientImpl:
Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
hive.druid.metadata.password=*********(redacted)
```
Closes#25954 from AngersZhuuuu/SPARK-29247.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Support the syntax of `ALTER (DATABASE|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax.
Ref:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLhttps://issues.apache.org/jira/browse/HIVE-8472
### Why are the changes needed?
Support more syntax.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#25883 from wangyum/SPARK-28476.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Added try exception
### Why are the changes needed?
The behaviors of run commands during exception handling are different depends on explain command. I think it should be unified.
[ >spark.sql("explain cost select * from hoge").show(false) ]
![cost](https://user-images.githubusercontent.com/55128575/65225389-09a80500-db00-11e9-9246-0f1a3a881595.png)
[ >spark.sql("explain extended select * from hoge").show(false) ]
![extemded](https://user-images.githubusercontent.com/55128575/65225430-188eb780-db00-11e9-99bf-ff550b2ffd12.png)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
tested manually
Closes#25848 from TomokoKomiyama/fix-explain.
Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR moves Hive test jars(`hive-contrib-*.jar` and `hive-hcatalog-core-*.jar`) from maven dependency to local file.
### Why are the changes needed?
`--jars` can't be tested since `hive-contrib-*.jar` and `hive-hcatalog-core-*.jar` are already in classpath.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test
Closes#25690 from wangyum/SPARK-27831-revert.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
The `SET` commands do not contain the `_FUNC_` pattern a priori. In the PR, I propose filter out such commands in the `using _FUNC_ instead of function names in examples` test.
### Why are the changes needed?
After the merge of https://github.com/apache/spark/pull/25942, examples will require particular settings. Currently, the whole expression example has to be ignored which is so much. It makes sense to ignore only `SET` commands in expression examples.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the `using _FUNC_ instead of function names in examples` test.
Closes#25958 from MaxGekk/dont-check-_FUNC_-in-set.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default.
Please take a look at current example of Like:
d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)
If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail.
```
The escape character is '\'. If an escape character precedes a special symbol or another
escape character, the following character is matched literally. It is invalid to escape
any other character.
```
For the query
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%';
```
SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike):
> LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%`
which are no longer having origin intention (see left).
Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';
```
> LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%`
Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`.
Same for RLIKE:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*';
```
> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`
which is OK, but
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
```
> RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.*`
which no longer haves origin intention.
Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.*';
```
> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`
### Why are the changes needed?
Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue.
### Does this PR introduce any user-facing change?
No, as it only corrects documentation.
### How was this patch tested?
Added debug log (like above) and ran queries from `spark-sql`.
Closes#25957 from HeartSaVioR/SPARK-29281.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel.
### Why are the changes needed?
This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example:
```
org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U';
at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48)
at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57)
at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite`
Closes#25956 from MaxGekk/fix-expr-examples-checks.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
SPARK-27210 enables ManifestFileCommitProtocol to clean up incomplete output files in task level if task is aborted.
This patch extends the area of cleaning up, proposes ManifestFileCommitProtocol to clean up complete but invalid output files in job level if job aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol.
## How was this patch tested?
Added UT.
Closes#24186 from HeartSaVioR/SPARK-27254.
Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Log the full spark-submit command in SparkSubmit#launchApplication
Adding .python-version (pyenv file) to RAT exclusion list
### What changes were proposed in this pull request?
Original motivation [here](http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-obtain-the-full-command-to-be-invoked-by-SparkLauncher-td35144.html), expanded in the [Jira](https://issues.apache.org/jira/browse/SPARK-29070).. In essence, we want to be able to log the full `spark-submit` command being constructed by `SparkLauncher`
### Why are the changes needed?
Currently, it is not possible to directly obtain this information from the `SparkLauncher` instance, which makes debugging and customer support more difficult.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
`core` `sbt` tests were executed. The `SparkLauncherSuite` (where I added assertions to an existing test) was also checked. Within that, `testSparkLauncherGetError` is failing, but that appears not to have been caused by this change (failing for me even on the parent commit of c18f849d76).
Closes#25777 from jeff303/SPARK-29070.
Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus.
### Why are the changes needed?
Bugfix.
This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the `require` in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets.
Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this `require`. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added UT
Closes#23375Closes#25946 from juliuszsompolski/SPARK-29263.
Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
### What changes were proposed in this pull request?
This PR makes `element_at` in PySpark able to take PySpark `Column` instances.
### Why are the changes needed?
To match with Scala side. Seems it was intended but not working correctly as a bug.
### Does this PR introduce any user-facing change?
Yes. See below:
```python
from pyspark.sql import functions as F
x = spark.createDataFrame([([1,2,3],1),([4,5,6],2),([7,8,9],3)],['list','num'])
x.withColumn('aa',F.element_at('list',x.num.cast('int'))).show()
```
Before:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/functions.py", line 2059, in element_at
return Column(sc._jvm.functions.element_at(_to_java_column(col), extraction))
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
File "/.../forked/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
```
After:
```
+---------+---+---+
| list|num| aa|
+---------+---+---+
|[1, 2, 3]| 1| 1|
|[4, 5, 6]| 2| 5|
|[7, 8, 9]| 3| 9|
+---------+---+---+
```
### How was this patch tested?
Manually tested against literal, Python native types, and PySpark column.
Closes#25950 from HyukjinKwon/SPARK-29240.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Updated the SQL migration guide regarding to recently supported special date and timestamp values, see https://github.com/apache/spark/pull/25716 and https://github.com/apache/spark/pull/25708.
Closes#25834
### Why are the changes needed?
To let users know about new feature in Spark 3.0.
### Does this PR introduce any user-facing change?
No
Closes#25948 from MaxGekk/special-values-migration-guide.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Hive 2.3 will set a new UDFClassLoader to hiveConf.classLoader when initializing SessionState since HIVE-11878, and
1. ADDJarCommand will add jars to clientLoader.classLoader.
2. --jar passed jar will be added to clientLoader.classLoader
3. jar passed by hive conf `hive.aux.jars` [SPARK-28954](https://github.com/apache/spark/pull/25653) [SPARK-28840](https://github.com/apache/spark/pull/25542) will be added to clientLoader.classLoader too
For these reason we cannot load the jars added by ADDJarCommand because of class loader got changed. We reset it to clientLoader.ClassLoader here.
### Why are the changes needed?
support for jdk11
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
UT
```
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH
build/sbt -Phive-thriftserver -Phadoop-3.2
hive/test-only *HiveSparkSubmitSuite -- -z "SPARK-8368: includes jars passed in through --jars"
hive-thriftserver/test-only *HiveThriftBinaryServerSuite -- -z "test add jar"
```
Closes#25775 from AngersZhuuuu/SPARK-29015-STS-JDK11.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different.
### Why are the changes needed?
This prevents mistakes in expression examples, and fixes existing mistakes in comments.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add new test to `SQLQuerySuite`.
Closes#25942 from MaxGekk/run-expr-examples.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator`
2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator`
3, add metric `explainedVariance` in `RegressionEvaluator`
### Why are the changes needed?
existing function in mllib.metrics should also be exposed in ml
### Does this PR introduce any user-facing change?
yes, this PR add two expert params and one metric option
### How was this patch tested?
existing and added tests
Closes#25940 from zhengruifeng/evaluator_add_param.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.
### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#25849 from xuanyuanking/SPARK-29175.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove unnecessary imports in `core` module.
### Why are the changes needed?
Clean code for Apache Spark 3.0.0.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Local test.
Closes#25927 from sev7e0/dev_0925.
Authored-by: sev7e0 <sev7e0@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.
### Why are the changes needed?
When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests should cover it since this doesn't change the behavior.
Closes#25928 from rahij/rr/exists-upstream.
Authored-by: Rahij Ramsharan <rramsharan@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`--driver-java-options` is not passed to driver process if the user runs the application in **Yarn client** mode
Run the below command
```
./bin/spark-sql --master yarn \
--driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555"
```
**In Spark 2.4.4**
```
java ... -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555
org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555 ...
```
**In Spark 3.0**
```
java ...
org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556 ...
```
This issue is caused by [SPARK-28980](https://github.com/apache/spark/pull/25684/files#diff-75e0f814aa3717db995fa701883dc4e1R395)
### Why are the changes needed?
Corrected the `isClientMode` API implementation
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually,
![image](https://user-images.githubusercontent.com/35216143/65383114-c92dce80-dd2d-11e9-86c1-60e6d7e09f1e.png)
Closes#25889 from sandeep-katta/yarnmode.
Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the order of elements while logging token. Header columns are printed as
```
"TOKENID", "HMAC", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE"
```
whereas the code prints out actual information as
```
"HMAC"(redacted), "TOKENID", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE"
```
This patch fixes this.
### Why are the changes needed?
Not critical but it doesn't line up with header columns.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A, as it's only logged as debug and it's obvious what/where is the problem and how it can be fixed.
Closes#25935 from HeartSaVioR/SPARK-27748-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
Changed 'Phive-thriftserver' to ' -Phive-thriftserver'.
### Why are the changes needed?
Typo
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually tested.
Closes#25937 from TomokoKomiyama/fix-build-doc.
Authored-by: Tomoko Komiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect.
After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will:
1. perform integral division with the / operator if both sides are integral types;
2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.
### Why are the changes needed?
Unify the external database dialect with one configuration, instead of small flags.
### Does this PR introduce any user-facing change?
A new configuration `spark.sql.dialect` for choosing a database dialect.
### How was this patch tested?
Existing tests.
Closes#25697 from gengliangwang/dialect.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename the package pgSQL to postgreSQL
### Why are the changes needed?
To address the comment in https://github.com/apache/spark/pull/25697#discussion_r328431070 . The official full name seems more reasonable.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#25936 from gengliangwang/renamePGSQL.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.
### Why are the changes needed?
Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.
### Does this PR introduce any user-facing change?
It changes the default save mode for V2 Tables in the DataFrameWriter APIs
### How was this patch tested?
Existing tests
Closes#25876 from brkyvz/removeSM.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to skip PlanExpression when doing subexpression elimination on executors.
### Why are the changes needed?
Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields.
The NPE looks like:
```
[info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression *** FAILED *** (175 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException
[info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548)
[info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36)
[info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425)
[info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102)
[info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added unit test.
Closes#25925 from viirya/SPARK-29239.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Don't allow calling append, overwrite, or overwritePartitions after tableProperty is used in DataFrameWriterV2 because table properties are not set as part of operations on existing tables. Only tables that are created or replaced can set table properties.
### Why are the changes needed?
The properties are discarded otherwise, so this avoids confusing behavior.
### Does this PR introduce any user-facing change?
Yes, but to a new API, DataFrameWriterV2.
### How was this patch tested?
Removed test cases that used this method and the append, etc. methods because they no longer compile.
Closes#25931 from rdblue/fix-dfw-v2-table-properties.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples.
### Why are the changes needed?
Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`.
Closes#25924 from MaxGekk/fix-func-examples.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch changes ReceiverSuite."receiver_life_cycle" to record actual calls with timestamp in FakeReceiver/FakeReceiverSupervisor, which doesn't rely on timing of stopping and starting receiver in restarting receiver. It enables us to give enough huge timeout on verification of restart as we can verify both stopping and starting together.
### Why are the changes needed?
The test is flaky without this patch. We increased timeout to fix flakyness of this test (15adcc8273) but even with longer timeout it has been still failing intermittently.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
I've reproduced test failure artificially via below diff:
```
diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
index faf6db82d5..d8977543c0 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
-191,9 +191,11 private[streaming] abstract class ReceiverSupervisor(
// thread pool.
logWarning("Restarting receiver with delay " + delay + " ms: " + message,
error.getOrElse(null))
+ Thread.sleep(1000)
stopReceiver("Restarting receiver with delay " + delay + "ms: " + message, error)
logDebug("Sleeping for " + delay)
Thread.sleep(delay)
+ Thread.sleep(1000)
logInfo("Starting receiver again")
startReceiver()
logInfo("Receiver started again")
```
and confirmed this patch doesn't fail with the change.
Closes#25862 from HeartSaVioR/SPARK-23197-v2.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Protected the `executorDataMap` under lock when accessing it out of 'DriverEndpoint''s methods.
### Why are the changes needed?
Just as the comments:
>
// Accessing `executorDataMap` in `DriverEndpoint.receive/receiveAndReply` doesn't need any
// protection. But accessing `executorDataMap` out of `DriverEndpoint.receive/receiveAndReply`
// must be protected by `CoarseGrainedSchedulerBackend.this`. Besides, `executorDataMap` should
// only be modified in `DriverEndpoint.receive/receiveAndReply` with protection by
// `CoarseGrainedSchedulerBackend.this`.
`executorDataMap` is not threadsafe, it should be protected by lock when accessing it out of `DriverEndpoint`
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Existed UT.
Closes#25922 from ConeyLiu/executorDataMap.
Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`.
### Why are the changes needed?
It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`.
Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog.
Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Newly added and updated test cases.
Closes#25903 from cloud-fan/current.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar"
### Why are the changes needed?
Providing spark side config entry for hive configurations.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
UT.
Closes#25661 from WeichenXu123/add_hive_conf.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 and it will cause a security vulnerabilities. We could get some security info from https://www.tenable.com/cve/CVE-2019-16335 and https://www.tenable.com/cve/CVE-2019-14540
This reference remind to upgrate the version of `jackson-databind` to 2.9.10 or later.
This PR also upgrade the version of jackson to 2.9.10.
### Why are the changes needed?
This PR fix the security vulnerabilities.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Exists UT.
Closes#25912 from beliefer/upgrade-jackson.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When I use `ProcfsMetricsGetterSuite for` testing, always throw out `java.lang.NullPointerException`. I think there is a problem with locating `new ProcfsMetricsGetter`, which will lead to `SparkEnv` not being initialized in time. This leads to `java.lang.NullPointerException` when the method is executed.
### Why are the changes needed?
For test.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Local testing
Closes#25918 from sev7e0/dev_0924.
Authored-by: sev7e0 <sev7e0@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR is an enhanced version of https://github.com/apache/spark/pull/25805 so I've kept the original text. The problem with the original PR can be found in comment.
This situation can happen when an external system (e.g. Oozie) generates
delegation tokens for a Spark application. The Spark driver will then run
against secured services, have proper credentials (the tokens), but no
kerberos credentials. So trying to do things that requires a kerberos
credential fails.
Instead, if no kerberos credentials are detected, just skip the whole
delegation token code.
Tested with an application that simulates Oozie; fails before the fix,
passes with the fix. Also with other DT-related tests to make sure other
functionality keeps working.
Closes#25901 from gaborgsomogyi/SPARK-29082.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Change the remote repo used in IsolatedClientLoader from datanucleus to google mirror.
### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars. The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is [no longer maintained](http://www.datanucleus.org:15080/downloads/maven2/README.txt). This will cause downloading failure and make hive test cases flaky while Jenkins host is blocked by maven central repo.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#25915 from xuanyuanking/SPARK-29229.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
common methods support extract weights
### Why are the changes needed?
today more and more ML algs support weighting, add this method will make impls simple
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing testsuites
Closes#25802 from zhengruifeng/add_extractInstances.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Move the rule `RemoveAllHints` after the batch `Resolution`.
### Why are the changes needed?
User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a test case
Closes#25746 from gatorsmile/moveRemoveAllHints.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component
### Why are the changes needed?
It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios.
### Does this PR introduce any user-facing change?
Yes, more info provided to user.
### How was this patch tested?
No tests added, only doc change.
Closes#25704 from sheepstop/master.
Authored-by: sheepstop <yangting617@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>