ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Chris Martin	76791b89f5	[SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon. Specifically: - Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated. - Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key. - Added the boilerplate for doctests to cogroup.py. Note that cogroup.py only contains the apply() function which has doctests disabled as per the other Pandas Udfs. - Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package. - Some minor formatting tweaks. This was tested by running the appropriate unit tests. I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check. Closes #25939 from d80tb7/SPARK-27463-fixes. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-30 22:25:35 +09:00
Jungtaek Lim (HeartSaVioR)	39eb79ac4b	[SPARK-28074][SS] Log warn message on possible correctness issue for multiple stateful operations in single query ## What changes were proposed in this pull request? Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch. This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users. This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative. ## How was this patch tested? Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query. Closes #24890 from HeartSaVioR/SPARK-28074. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-30 08:18:23 -05:00
gengjiaan	1018390542	[SPARK-29252][BUILD] Upgrade zookeeper to 3.4.14 and fix vulnerabilities ### What changes were proposed in this pull request? The current code uses org.apache.zookeeper:zookeeper:jar:3.4.6 and it will cause a security vulnerabilities. We could get some security info from https://www.tenable.com/cve/CVE-2019-0201 This reference remind to upgrate the version of `zookeeper` to 3.4.14 or later. ### Why are the changes needed? This PR fix the security vulnerabilities. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #25933 from beliefer/upgrade-zookeeper. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-30 08:16:32 -05:00
Sean Owen	28b8383a6c	[SPARK-29289][BUILD] Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators for 2.13 ### What changes were proposed in this pull request? Update scalatest, scalacheck, scopt, clapper, scala-parser-combinators to latest maintenance release that is also cross-published for Scala 2.13. ### Why are the changes needed? To build in the future for Scala 2.13 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #25967 from srowen/SPARK-29289. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-30 08:13:57 -05:00
Dongjoon Hyun	989b0c773f	[SPARK-29297][TESTS] Compare `core`/`mllib` module benchmarks in JDK8/11 ### What changes were proposed in this pull request? This PR regenerate the benchmark results in `core` and `mllib` module in order to compare JDK8/JDK11 result. ### Why are the changes needed? According to the result, For `PropertiesCloneBenchmark` and `UDTSerializationBenchmark`, JDK11 is slightly faster. In general, there is no regression in JDK11. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. Manually run the benchmark. Closes #25969 from dongjoon-hyun/SPARK-29297. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-29 21:43:58 -07:00
Liang-Chi Hsieh	dd92e15301	[SPARK-29186][SQL] AliasIdentifier should be converted to Json in prettyJson ### What changes were proposed in this pull request? This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson. ### Why are the changes needed? When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias", "num-children" : 1, "name" : null, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", ... ``` Seems the alias name was in the Json before SPARK-19602. It is fixed by this patch: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias", "num-children" : 1, "name" : { "product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier", "identifier" : "t1" }, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", ... ``` ### Does this PR introduce any user-facing change? Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier. ### How was this patch tested? Added unit test. Closes #25959 from viirya/SPARK-29186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-29 20:00:13 -07:00
Unknown	3ea9d6825b	[SPARK-29019][WEBUI] Improve tooltip JDBC/ODBC Server tab ### What changes were proposed in this pull request? Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand. We have documented it at SPARK-28373 but I think it is better to have some tooltips in the SQL statistics table to explain the columns ![image](https://user-images.githubusercontent.com/12819544/64489775-38e48980-d257-11e9-868a-5f5f6a0f1e46.png) The columns with new tooltips are finish time, close time, execution time and duration ![image](https://user-images.githubusercontent.com/12819544/64489858-1141f100-d258-11e9-9e4e-fae3299da465.png) Improvements in UIUtils can be used in other tables in WebUI to add tooltips ### Why are the changes needed? It is interesting to improve the undestanding of the WebUI ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests are added and manual test. Closes #25723 from planga82/feature/SPARK-29019_tooltipjdbcServer. Lead-authored-by: Unknown <soypab@gmail.com> Co-authored-by: Pablo <soypab@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-29 18:34:24 -05:00
Dongjoon Hyun	df28671800	[SPARK-29282][TESTS] Use the same VM configurations for test/benchmark ### What changes were proposed in this pull request? This PR aims to specify the JDK8 default configurations `-XX:+UseParallelGC -XX:-UseDynamicNumberOfGCThreads` explicitly. As we see in this PR [here](https://github.com/apache/spark/pull/25966/files#diff-12b89b7ee67c63c2254b749c8f8d0694R10), this will make the comparison between JDK8 and JDK11 easier by removing a misleading regression. NOTE THAT THESE JVM CONFS ARE ONLY FOR BENCHMARK COMPARISON, NOT FOR A PRODUCTION ### Why are the changes needed? There exists many JVM-level changes between JDK8 and JDK11. For example, the followings are notable changes and it turns out that especially (1) and (2) shows a misleading regression in our micro-benchmark environment because our microbenchmark uses small VM memory. 1. [JEP 248: Make G1 the Default Garbage Collector](https://bugs.openjdk.java.net/browse/JDK-8073273) JDK9+ 2. [Enable UseDynamicNumberOfGCThreads by default](https://bugs.openjdk.java.net/browse/JDK-8198547) JDK11+ 3. [Change default value of HeapSizePerGCThread](https://bugs.openjdk.java.net/browse/JDK-8200417) JDK11+ ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only JVM configuration change. Manually, run the benchmark. Closes #25966 from dongjoon-hyun/SPARK-29282. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-29 15:11:46 -07:00
angerszhu	1d4b2f010b	[SPARK-29247][SQL] Redact sensitive information in when construct HiveClientHive.state ### What changes were proposed in this pull request? HiveClientImpl may be log sensitive information. e.g. url, secret and token: ```scala logDebug( s""" \|Applying Hadoop/Hive/Spark and extra properties to Hive Conf: \|$k=${if (k.toLowerCase(Locale.ROOT).contains("password")) "xxx" else v} """.stripMargin) ``` So redact it. Use SQLConf.get.redactOptions. I add a new overloading function to fit this situation for one by one kv pair situation. ### Why are the changes needed? Redact sensitive information when construct HiveClientImpl ### Does this PR introduce any user-facing change? No ### How was this patch tested? MT Run command ` /sbin/start-thriftserver.sh` In log we can get ``` 19/09/28 08:27:02 main DEBUG HiveClientImpl: Applying Hadoop/Hive/Spark and extra properties to Hive Conf: hive.druid.metadata.password=*********(redacted) ``` Closes #25954 from AngersZhuuuu/SPARK-29247. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-29 14:30:32 -07:00
Yuming Wang	31700116d2	[SPARK-28476][SQL] Support ALTER DATABASE SET LOCATION ### What changes were proposed in this pull request? Support the syntax of `ALTER (DATABASE\|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax. Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL https://issues.apache.org/jira/browse/HIVE-8472 ### Why are the changes needed? Support more syntax. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25883 from wangyum/SPARK-28476. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-29 11:31:49 -07:00
TomokoKomiyama	67d5b9b157	[SPARK-29172][SQL] Fix some exception issue of explain commands ### What changes were proposed in this pull request? Added try exception ### Why are the changes needed? The behaviors of run commands during exception handling are different depends on explain command. I think it should be unified. [ >spark.sql("explain cost select * from hoge").show(false) ] ![cost](https://user-images.githubusercontent.com/55128575/65225389-09a80500-db00-11e9-9246-0f1a3a881595.png) [ >spark.sql("explain extended select * from hoge").show(false) ] ![extemded](https://user-images.githubusercontent.com/55128575/65225430-188eb780-db00-11e9-99bf-ff550b2ffd12.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? tested manually Closes #25848 from TomokoKomiyama/fix-explain. Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-29 10:41:57 -05:00
Yuming Wang	8167714cab	[SPARK-27831][FOLLOW-UP][SQL][TEST] Should not use maven to add Hive test jars ### What changes were proposed in this pull request? This PR moves Hive test jars(`hive-contrib-.jar` and `hive-hcatalog-core-.jar`) from maven dependency to local file. ### Why are the changes needed? `--jars` can't be tested since `hive-contrib-.jar` and `hive-hcatalog-core-.jar` are already in classpath. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test Closes #25690 from wangyum/SPARK-27831-revert. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-28 16:55:49 -07:00
Maxim Gekk	2409320d8f	[SPARK-29237][SQL][FOLLOWUP] Ignore `SET` commands in expression examples while checking the _FUNC_ pattern ### What changes were proposed in this pull request? The `SET` commands do not contain the `_FUNC_` pattern a priori. In the PR, I propose filter out such commands in the `using _FUNC_ instead of function names in examples` test. ### Why are the changes needed? After the merge of https://github.com/apache/spark/pull/25942, examples will require particular settings. Currently, the whole expression example has to be ignored which is so much. It makes sense to ignore only `SET` commands in expression examples. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the `using _FUNC_ instead of function names in examples` test. Closes #25958 from MaxGekk/dont-check-_FUNC_-in-set. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 08:51:47 +09:00
Jungtaek Lim (HeartSaVioR)	94946e4836	[SPARK-29281][SQL] Correct example of Like/RLike to test the origin intention correctly ### What changes were proposed in this pull request? This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default. Please take a look at current example of Like: `d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)` If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail. ``` The escape character is '\'. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character. ``` For the query ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%'; ``` SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike): > LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%` which are no longer having origin intention (see left). Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%'; ``` > LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%` Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`. Same for RLIKE: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` which is OK, but ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.'; ``` > RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.` which no longer haves origin intention. Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` ### Why are the changes needed? Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue. ### Does this PR introduce any user-facing change? No, as it only corrects documentation. ### How was this patch tested? Added debug log (like above) and ran queries from `spark-sql`. Closes #25957 from HeartSaVioR/SPARK-29281. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 03:05:49 +09:00
Maxim Gekk	ece4213176	[SPARK-21914][FOLLOWUP][TEST-HADOOP3.2][TEST-JAVA11] Clone SparkSession per each function example ### What changes were proposed in this pull request? In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel. ### Why are the changes needed? This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example: ``` org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U'; at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48) at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57) at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite` Closes #25956 from MaxGekk/fix-expr-examples-checks. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 02:57:55 +09:00
Jungtaek Lim (HeartSaVioR)	d72f39897b	[SPARK-27254][SS] Cleanup complete but invalid output files in ManifestFileCommitProtocol if job is aborted ## What changes were proposed in this pull request? SPARK-27210 enables ManifestFileCommitProtocol to clean up incomplete output files in task level if task is aborted. This patch extends the area of cleaning up, proposes ManifestFileCommitProtocol to clean up complete but invalid output files in job level if job aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol. ## How was this patch tested? Added UT. Closes #24186 from HeartSaVioR/SPARK-27254. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-09-27 12:35:26 -07:00
Jeff Evans	233c214a75	[SPARK-29070][CORE] Make SparkLauncher log full spark-submit command line Log the full spark-submit command in SparkSubmit#launchApplication Adding .python-version (pyenv file) to RAT exclusion list ### What changes were proposed in this pull request? Original motivation [here](http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-obtain-the-full-command-to-be-invoked-by-SparkLauncher-td35144.html), expanded in the [Jira](https://issues.apache.org/jira/browse/SPARK-29070).. In essence, we want to be able to log the full `spark-submit` command being constructed by `SparkLauncher` ### Why are the changes needed? Currently, it is not possible to directly obtain this information from the `SparkLauncher` instance, which makes debugging and customer support more difficult. ### Does this PR introduce any user-facing change? No ### How was this patch tested? `core` `sbt` tests were executed. The `SparkLauncherSuite` (where I added assertions to an existing test) was also checked. Within that, `testSparkLauncherGetError` is failing, but that appears not to have been caused by this change (failing for me even on the parent commit of `c18f849d76`). Closes #25777 from jeff303/SPARK-29070. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-27 11:32:22 -07:00
Juliusz Sompolski	420abb457d	[SPARK-29263][SCHEDULER] Update `availableSlots` in `resourceOffers()` before checking available slots for barrier taskSet ### What changes were proposed in this pull request? availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus. ### Why are the changes needed? Bugfix. This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the `require` in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets. Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this `require`. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT Closes #23375 Closes #25946 from juliuszsompolski/SPARK-29263. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-09-27 11:18:32 -07:00
HyukjinKwon	fda0e6e48d	[SPARK-29240][PYTHON] Pass Py4J column instance to support PySpark column in element_at function ### What changes were proposed in this pull request? This PR makes `element_at` in PySpark able to take PySpark `Column` instances. ### Why are the changes needed? To match with Scala side. Seems it was intended but not working correctly as a bug. ### Does this PR introduce any user-facing change? Yes. See below: ```python from pyspark.sql import functions as F x = spark.createDataFrame([([1,2,3],1),([4,5,6],2),([7,8,9],3)],['list','num']) x.withColumn('aa',F.element_at('list',x.num.cast('int'))).show() ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 2059, in element_at return Column(sc._jvm.functions.element_at(_to_java_column(col), extraction)) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__ File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args File "/.../forked/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__ raise TypeError("Column is not iterable") TypeError: Column is not iterable ``` After: ``` +---------+---+---+ \| list\|num\| aa\| +---------+---+---+ \|[1, 2, 3]\| 1\| 1\| \|[4, 5, 6]\| 2\| 5\| \|[7, 8, 9]\| 3\| 9\| +---------+---+---+ ``` ### How was this patch tested? Manually tested against literal, Python native types, and PySpark column. Closes #25950 from HyukjinKwon/SPARK-29240. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-27 11:04:55 -07:00
Maxim Gekk	4bffcf5a34	[SPARK-29275][SQL][DOC] Describe special date/timestamp values in the SQL migration guide ### What changes were proposed in this pull request? Updated the SQL migration guide regarding to recently supported special date and timestamp values, see https://github.com/apache/spark/pull/25716 and https://github.com/apache/spark/pull/25708. Closes #25834 ### Why are the changes needed? To let users know about new feature in Spark 3.0. ### Does this PR introduce any user-facing change? No Closes #25948 from MaxGekk/special-values-migration-guide. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-27 10:36:20 -07:00
angerszhu	cc852d4eec	[SPARK-29015][SQL][TEST-HADOOP3.2] Reset class loader after initializing SessionState for built-in Hive 2.3 ### What changes were proposed in this pull request? Hive 2.3 will set a new UDFClassLoader to hiveConf.classLoader when initializing SessionState since HIVE-11878, and 1. ADDJarCommand will add jars to clientLoader.classLoader. 2. --jar passed jar will be added to clientLoader.classLoader 3. jar passed by hive conf `hive.aux.jars` [SPARK-28954](https://github.com/apache/spark/pull/25653) [SPARK-28840](https://github.com/apache/spark/pull/25542) will be added to clientLoader.classLoader too For these reason we cannot load the jars added by ADDJarCommand because of class loader got changed. We reset it to clientLoader.ClassLoader here. ### Why are the changes needed? support for jdk11 ### Does this PR introduce any user-facing change? NO ### How was this patch tested? UT ``` export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt -Phive-thriftserver -Phadoop-3.2 hive/test-only HiveSparkSubmitSuite -- -z "SPARK-8368: includes jars passed in through --jars" hive-thriftserver/test-only HiveThriftBinaryServerSuite -- -z "test add jar" ``` Closes #25775 from AngersZhuuuu/SPARK-29015-STS-JDK11. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-27 10:23:56 -05:00
Maxim Gekk	4dd0066d40	[SPARK-21914][SQL][TESTS] Check results of expression examples ### What changes were proposed in this pull request? New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different. ### Why are the changes needed? This prevents mistakes in expression examples, and fixes existing mistakes in comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new test to `SQLQuerySuite`. Closes #25942 from MaxGekk/run-expr-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-27 21:30:37 +09:00
Wang Shuo	bd28e8e179	[SPARK-29213][SQL] Generate extra IsNotNull predicate in FilterExec ### What changes were proposed in this pull request? Currently the behavior of getting output and generating null checks in `FilterExec` is different. Thus some nullable attribute could be treated as not nullable by mistake. In `FilterExec.ouput`, an attribute is marked as nullable or not by finding its `exprId` in notNullAttributes: ``` a.nullable && notNullAttributes.contains(a.exprId) ``` But in `FilterExec.doConsume`, a `nullCheck` is generated or not for a predicate is decided by whether there is semantic equal not null predicate: ``` val nullChecks = c.references.map { r => val idx = notNullPreds.indexWhere { n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} if (idx != -1 && !generatedIsNotNullChecks(idx)) { generatedIsNotNullChecks(idx) = true // Use the child's output. The nullability is what the child produced. genPredicate(notNullPreds(idx), input, child.output) } else { "" } }.mkString("\n").trim ``` NPE will happen when run the SQL below: ``` sql("create table table1(x string)") sql("create table table2(x bigint)") sql("create table table3(x string)") sql("insert into table2 select null as x") sql( """ \|select t1.x \|from ( \| select x from table1) t1 \|left join ( \| select x from ( \| select x from table2 \| union all \| select substr(x,5) x from table3 \| ) a \| where length(x)>0 \|) t3 \|on t1.x=t3.x """.stripMargin).collect() ``` NPE Exception: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:40) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:135) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:449) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` the generated code: ``` == Subtree 4 / 5 == (2) Project [cast(x#7L as string) AS x#9] +- (2) Filter ((length(cast(x#7L as string)) > 0) AND isnotnull(cast(x#7L as string))) +- Scan hive default.table2 [x#7L], HiveTableRelation `default`.`table2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#7L] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 011 / / 012 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 021 / filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 022 / / 023 / } / 024 / / 025 / protected void processNext() throws java.io.IOException { / 026 / while ( inputadapter_input_0.hasNext()) { / 027 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 028 / / 029 / do { / 030 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 031 / long inputadapter_value_0 = inputadapter_isNull_0 ? / 032 / -1L : (inputadapter_row_0.getLong(0)); / 033 / / 034 / boolean filter_isNull_2 = inputadapter_isNull_0; / 035 / UTF8String filter_value_2 = null; / 036 / if (!inputadapter_isNull_0) { / 037 / filter_value_2 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 038 / } / 039 / int filter_value_1 = -1; / 040 / filter_value_1 = (filter_value_2).numChars(); / 041 / / 042 / boolean filter_value_0 = false; / 043 / filter_value_0 = filter_value_1 > 0; / 044 / if (!filter_value_0) continue; / 045 / / 046 / boolean filter_isNull_6 = inputadapter_isNull_0; / 047 / UTF8String filter_value_6 = null; / 048 / if (!inputadapter_isNull_0) { / 049 / filter_value_6 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 050 / } / 051 / if (!(!filter_isNull_6)) continue; / 052 / / 053 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 054 / / 055 / boolean project_isNull_0 = false; / 056 / UTF8String project_value_0 = null; / 057 / if (!false) { / 058 / project_value_0 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 059 / } / 060 / filter_mutableStateArray_0[1].reset(); / 061 / / 062 / filter_mutableStateArray_0[1].zeroOutNullBytes(); / 063 / / 064 / if (project_isNull_0) { / 065 / filter_mutableStateArray_0[1].setNullAt(0); / 066 / } else { / 067 / filter_mutableStateArray_0[1].write(0, project_value_0); / 068 / } / 069 / append((filter_mutableStateArray_0[1].getRow())); / 070 / / 071 / } while(false); / 072 / if (shouldStop()) return; / 073 / } / 074 / } / 075 / / 076 / } ``` This PR proposes to use semantic comparison both in `FilterExec.output` and `FilterExec.doConsume` for nullable attribute. With this PR, the generated code snippet is below: ``` == Subtree 2 / 5 == (3) Project [substring(x#8, 5, 2147483647) AS x#5] +- (3) Filter ((length(substring(x#8, 5, 2147483647)) > 0) AND isnotnull(substring(x#8, 5, 2147483647))) +- Scan hive default.table3 [x#8], HiveTableRelation `default`.`table3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#8] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage3(references); / 003 / } / 004 / / 005 / // codegenStageId=3 / 006 / final class GeneratedIteratorForCodegenStage3 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 011 / / 012 / public GeneratedIteratorForCodegenStage3(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 021 / filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 022 / / 023 / } / 024 / / 025 / protected void processNext() throws java.io.IOException { / 026 / while ( inputadapter_input_0.hasNext()) { / 027 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 028 / / 029 / do { / 030 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 031 / UTF8String inputadapter_value_0 = inputadapter_isNull_0 ? / 032 / null : (inputadapter_row_0.getUTF8String(0)); / 033 / / 034 / boolean filter_isNull_0 = true; / 035 / boolean filter_value_0 = false; / 036 / boolean filter_isNull_2 = true; / 037 / UTF8String filter_value_2 = null; / 038 / / 039 / if (!inputadapter_isNull_0) { / 040 / filter_isNull_2 = false; // resultCode could change nullability. / 041 / filter_value_2 = inputadapter_value_0.substringSQL(5, 2147483647); / 042 / / 043 / } / 044 / boolean filter_isNull_1 = filter_isNull_2; / 045 / int filter_value_1 = -1; / 046 / / 047 / if (!filter_isNull_2) { / 048 / filter_value_1 = (filter_value_2).numChars(); / 049 / } / 050 / if (!filter_isNull_1) { / 051 / filter_isNull_0 = false; // resultCode could change nullability. / 052 / filter_value_0 = filter_value_1 > 0; / 053 / / 054 / } / 055 / if (filter_isNull_0 \|\| !filter_value_0) continue; / 056 / boolean filter_isNull_8 = true; / 057 / UTF8String filter_value_8 = null; / 058 / / 059 / if (!inputadapter_isNull_0) { / 060 / filter_isNull_8 = false; // resultCode could change nullability. / 061 / filter_value_8 = inputadapter_value_0.substringSQL(5, 2147483647); / 062 / / 063 / } / 064 / if (!(!filter_isNull_8)) continue; / 065 / / 066 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 067 / / 068 / boolean project_isNull_0 = true; / 069 / UTF8String project_value_0 = null; / 070 / / 071 / if (!inputadapter_isNull_0) { / 072 / project_isNull_0 = false; // resultCode could change nullability. / 073 / project_value_0 = inputadapter_value_0.substringSQL(5, 2147483647); / 074 / / 075 / } / 076 / filter_mutableStateArray_0[1].reset(); / 077 / / 078 / filter_mutableStateArray_0[1].zeroOutNullBytes(); / 079 / / 080 / if (project_isNull_0) { / 081 / filter_mutableStateArray_0[1].setNullAt(0); / 082 / } else { / 083 / filter_mutableStateArray_0[1].write(0, project_value_0); / 084 / } / 085 / append((filter_mutableStateArray_0[1].getRow())); / 086 / / 087 / } while(false); / 088 / if (shouldStop()) return; / 089 / } / 090 / } / 091 / / 092 */ } ``` ### Why are the changes needed? Fix NPE bug in FilterExec. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new UT Closes #25902 from wangshuo128/filter-codegen-npe. Authored-by: Wang Shuo <wangshuo128@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-27 15:14:17 +08:00
zhengruifeng	aed7ff36f7	[SPARK-29258][ML][PYSPARK] parity between ml.evaluator and mllib.metrics ### What changes were proposed in this pull request? 1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator` 2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator` 3, add metric `explainedVariance` in `RegressionEvaluator` ### Why are the changes needed? existing function in mllib.metrics should also be exposed in ml ### Does this PR introduce any user-facing change? yes, this PR add two expert params and one metric option ### How was this patch tested? existing and added tests Closes #25940 from zhengruifeng/evaluator_add_param. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-27 13:30:03 +08:00
Yuanjian Li	ada3ad34c6	[SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable ### What changes were proposed in this pull request? Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror. ### Why are the changes needed? We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars, end-users can set this config if the default maven central repo is unreachable. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #25849 from xuanyuanking/SPARK-29175. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 20:57:44 -07:00
uncleGen	570525f886	[SPARK-27715][SQL][UI] SQL query details in UI does not show in correct format ## What changes were proposed in this pull request? before pr: ![image](https://user-images.githubusercontent.com/7402327/57752168-bb7e9180-771a-11e9-8757-63236ecab753.png) after pr: ![image](https://user-images.githubusercontent.com/7402327/57752175-c802ea00-771a-11e9-96fd-aef1890b7985.png) ## How was this patch tested? manual test Closes #24609 from uncleGen/SPARK-27715. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-26 22:52:22 -05:00
sev7e0	cd04607785	[SPARK-29246][CORE] Remove unnecessary imports in `core` module ### What changes were proposed in this pull request? Remove unnecessary imports in `core` module. ### Why are the changes needed? Clean code for Apache Spark 3.0.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Local test. Closes #25927 from sev7e0/dev_0925. Authored-by: sev7e0 <sev7e0@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 20:43:08 -07:00
Huaxin Gao	bdc4943b9e	[SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict ### What changes were proposed in this pull request? Add the following Params classes in Pyspark clustering ```GaussianMixtureParams``` ```KMeansParams``` ```BisectingKMeansParams``` ```LDAParams``` ```PowerIterationClusteringParams``` ### Why are the changes needed? To be consistent with scala side ### Does this PR introduce any user-facing change? Yes. Add the following changes: ``` GaussianMixtureModel - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setPredictionCol - get/setProbabilityCol - get/setTol - predict ``` ``` KMeansModel - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setPredictionCol - get/setDistanceMeasure - get/setTol - predict ``` ``` BisectingKMeansModel - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setPredictionCol - get/setDistanceMeasure - predict ``` ``` LDAModel(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval): - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setCheckpointInterval ``` ### How was this patch tested? Add doctests Closes #25859 from huaxingao/spark-29142. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-27 11:19:02 +08:00
Rahij Ramsharan	9f3c82163a	[SPARK-29259][SQL] call fs.exists only when necessary ### What changes were proposed in this pull request? Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand. ### Why are the changes needed? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests should cover it since this doesn't change the behavior. Closes #25928 from rahij/rr/exists-upstream. Authored-by: Rahij Ramsharan <rramsharan@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 15:46:31 -07:00
sandeep katta	103de96059	[SPARK-29202][DEPLOY] Driver java options are not passed to driver process in Yarn client mode ### What changes were proposed in this pull request? `--driver-java-options` is not passed to driver process if the user runs the application in Yarn client mode Run the below command ``` ./bin/spark-sql --master yarn \ --driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555" ``` In Spark 2.4.4 ``` java ... -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555 org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555 ... ``` In Spark 3.0 ``` java ... org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556 ... ``` This issue is caused by [SPARK-28980](https://github.com/apache/spark/pull/25684/files#diff-75e0f814aa3717db995fa701883dc4e1R395) ### Why are the changes needed? Corrected the `isClientMode` API implementation ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually, ![image](https://user-images.githubusercontent.com/35216143/65383114-c92dce80-dd2d-11e9-86c1-60e6d7e09f1e.png) Closes #25889 from sandeep-katta/yarnmode. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 15:37:20 -07:00
Jungtaek Lim (HeartSaVioR)	d3679a9782	[SPARK-27748][SS][FOLLOWUP] Correct the order of logging token as debug log ### What changes were proposed in this pull request? This patch fixes the order of elements while logging token. Header columns are printed as ``` "TOKENID", "HMAC", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE" ``` whereas the code prints out actual information as ``` "HMAC"(redacted), "TOKENID", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE" ``` This patch fixes this. ### Why are the changes needed? Not critical but it doesn't line up with header columns. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A, as it's only logged as debug and it's obvious what/where is the problem and how it can be fixed. Closes #25935 from HeartSaVioR/SPARK-27748-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-26 08:32:03 -07:00
Tomoko Komiyama	8beb736a00	[SPARK-29256][DOCS] Fix typo in building document ### What changes were proposed in this pull request? Changed 'Phive-thriftserver' to ' -Phive-thriftserver'. ### Why are the changes needed? Typo ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Manually tested. Closes #25937 from TomokoKomiyama/fix-build-doc. Authored-by: Tomoko Komiyama <btkomiyamatm@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-26 08:23:43 -05:00
Gengliang Wang	a1213d5f96	[SPARK-28997][SQL] Add `spark.sql.dialect` ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 21:00:27 +08:00
Gengliang Wang	66c9dc316a	[SPARK-29255][SQL][TESTS] Rename package pgSQL to postgreSQL ### What changes were proposed in this pull request? Rename the package pgSQL to postgreSQL ### Why are the changes needed? To address the comment in https://github.com/apache/spark/pull/25697#discussion_r328431070 . The official full name seems more reasonable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25936 from gengliangwang/renamePGSQL. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-26 05:36:15 -07:00
Burak Yavuz	c8159c7941	[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter ### What changes were proposed in this pull request? It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR. ### Why are the changes needed? Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark. ### Does this PR introduce any user-facing change? It changes the default save mode for V2 Tables in the DataFrameWriter APIs ### How was this patch tested? Existing tests Closes #25876 from brkyvz/removeSM. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 15:20:04 +08:00
Liang-Chi Hsieh	b8b59d6fa3	[SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression ### What changes were proposed in this pull request? This patch proposes to skip PlanExpression when doing subexpression elimination on executors. ### Why are the changes needed? Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields. The NPE looks like: ``` [info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression * FAILED * (175 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException [info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548) [info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36) [info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added unit test. Closes #25925 from viirya/SPARK-29239. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 13:55:01 +08:00
Ryan Blue	6a4235aee7	[SPARK-29249][SQL] V2 writer: Don't allow tableProperty for existing tables ### What changes were proposed in this pull request? Don't allow calling append, overwrite, or overwritePartitions after tableProperty is used in DataFrameWriterV2 because table properties are not set as part of operations on existing tables. Only tables that are created or replaced can set table properties. ### Why are the changes needed? The properties are discarded otherwise, so this avoids confusing behavior. ### Does this PR introduce any user-facing change? Yes, but to a new API, DataFrameWriterV2. ### How was this patch tested? Removed test cases that used this method and the append, etc. methods because they no longer compile. Closes #25931 from rdblue/fix-dfw-v2-table-properties. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 12:41:34 +08:00
Maxim Gekk	21db2f86f7	[SPARK-29237][SQL] Prevent real function names in expression example template ### What changes were proposed in this pull request? In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples. ### Why are the changes needed? Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`. Closes #25924 from MaxGekk/fix-func-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-25 15:16:00 -07:00
Jungtaek Lim (HeartSaVioR)	a1b90bfc0f	[SPARK-23197][STREAMING][TESTS] Fix ReceiverSuite."receiver_life_cycle" to not rely on timing ### What changes were proposed in this pull request? This patch changes ReceiverSuite."receiver_life_cycle" to record actual calls with timestamp in FakeReceiver/FakeReceiverSupervisor, which doesn't rely on timing of stopping and starting receiver in restarting receiver. It enables us to give enough huge timeout on verification of restart as we can verify both stopping and starting together. ### Why are the changes needed? The test is flaky without this patch. We increased timeout to fix flakyness of this test (`15adcc8273`) but even with longer timeout it has been still failing intermittently. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I've reproduced test failure artificially via below diff: ``` diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala index faf6db82d5..d8977543c0 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala -191,9 +191,11 private[streaming] abstract class ReceiverSupervisor( // thread pool. logWarning("Restarting receiver with delay " + delay + " ms: " + message, error.getOrElse(null)) + Thread.sleep(1000) stopReceiver("Restarting receiver with delay " + delay + "ms: " + message, error) logDebug("Sleeping for " + delay) Thread.sleep(delay) + Thread.sleep(1000) logInfo("Starting receiver again") startReceiver() logInfo("Receiver started again") ``` and confirmed this patch doesn't fail with the change. Closes #25862 from HeartSaVioR/SPARK-23197-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-25 10:59:08 -07:00
Xianyang Liu	e07cbbe9c9	[SPARK-29236][CORE] Access 'executorDataMap' out of 'DriverEndpoint' should be protected by lock ### What changes were proposed in this pull request? Protected the `executorDataMap` under lock when accessing it out of 'DriverEndpoint''s methods. ### Why are the changes needed? Just as the comments: > // Accessing `executorDataMap` in `DriverEndpoint.receive/receiveAndReply` doesn't need any // protection. But accessing `executorDataMap` out of `DriverEndpoint.receive/receiveAndReply` // must be protected by `CoarseGrainedSchedulerBackend.this`. Besides, `executorDataMap` should // only be modified in `DriverEndpoint.receive/receiveAndReply` with protection by // `CoarseGrainedSchedulerBackend.this`. `executorDataMap` is not threadsafe, it should be protected by lock when accessing it out of `DriverEndpoint` ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Existed UT. Closes #25922 from ConeyLiu/executorDataMap. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-25 22:38:59 +08:00
Tomoko Komiyama	58989cd1b0	[SPARK-29168][WEBUI][FOLLOW-UP] Use a dark colors on selected Executor removed on timeline view ### What changes were proposed in this pull request? Changed Executor color settings in timeline-view.css ### Why are the changes needed? In WebUI, color of executor changes to dark blue when you click it. It might be confused user because of the color. [ Before ] ![r_before](https://user-images.githubusercontent.com/55128575/65562403-48671080-df81-11e9-98ff-193e3f058cec.png) [ After ] ![r_after](https://user-images.githubusercontent.com/55128575/65562427-62085800-df81-11e9-8465-6fb9a966d0bc.png) ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Manually test. Closes #25921 from TomokoKomiyama/fix-js-2. Authored-by: Tomoko Komiyama <btkomiyamatm@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-25 05:15:29 -07:00
Wenchen Fan	a36a7235db	[SPARK-29215][SQL] current namespace should be tracked in SessionCatalog if the current catalog is session catalog ### What changes were proposed in this pull request? when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`. ### Why are the changes needed? It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`. Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog. Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Newly added and updated test cases. Closes #25903 from cloud-fan/current. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-09-25 17:01:36 +08:00
WeichenXu	d8b0914c2e	[SPARK-28957][SQL] Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar" ### What changes were proposed in this pull request? Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar" ### Why are the changes needed? Providing spark side config entry for hive configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #25661 from WeichenXu123/add_hive_conf. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-25 15:54:44 +08:00
gengjiaan	eef3abbb90	[SPARK-29226][BUILD] Upgrade jackson-databind to 2.9.10 and fix vulnerabilities ### What changes were proposed in this pull request? The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 and it will cause a security vulnerabilities. We could get some security info from https://www.tenable.com/cve/CVE-2019-16335 and https://www.tenable.com/cve/CVE-2019-14540 This reference remind to upgrate the version of `jackson-databind` to 2.9.10 or later. This PR also upgrade the version of jackson to 2.9.10. ### Why are the changes needed? This PR fix the security vulnerabilities. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #25912 from beliefer/upgrade-jackson. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-24 22:05:13 -07:00
sev7e0	e650f8fdba	[SPARK-29230][CORE][TEST] Fix NPE in ProcfsMetricsGetterSuite ### What changes were proposed in this pull request? When I use `ProcfsMetricsGetterSuite for` testing, always throw out `java.lang.NullPointerException`. I think there is a problem with locating `new ProcfsMetricsGetter`, which will lead to `SparkEnv` not being initialized in time. This leads to `java.lang.NullPointerException` when the method is executed. ### Why are the changes needed? For test. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Local testing Closes #25918 from sev7e0/dev_0924. Authored-by: sev7e0 <sev7e0@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-24 14:09:40 -07:00
Gabor Somogyi	d75588c57a	[SPARK-29082][CORE] Skip delegation token generation if no credentials are available This PR is an enhanced version of https://github.com/apache/spark/pull/25805 so I've kept the original text. The problem with the original PR can be found in comment. This situation can happen when an external system (e.g. Oozie) generates delegation tokens for a Spark application. The Spark driver will then run against secured services, have proper credentials (the tokens), but no kerberos credentials. So trying to do things that requires a kerberos credential fails. Instead, if no kerberos credentials are detected, just skip the whole delegation token code. Tested with an application that simulates Oozie; fails before the fix, passes with the fix. Also with other DT-related tests to make sure other functionality keeps working. Closes #25901 from gaborgsomogyi/SPARK-29082. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-24 11:12:26 -07:00
Yuanjian Li	b3e9be470c	[SPARK-29229][SQL] Change the additional remote repository in IsolatedClientLoader to google minor ### What changes were proposed in this pull request? Change the remote repo used in IsolatedClientLoader from datanucleus to google mirror. ### Why are the changes needed? We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars. The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is [no longer maintained](http://www.datanucleus.org:15080/downloads/maven2/README.txt). This will cause downloading failure and make hive test cases flaky while Jenkins host is blocked by maven central repo. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #25915 from xuanyuanking/SPARK-29229. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-25 00:49:50 +08:00
zhengruifeng	fff2e847c2	[SPARK-29095][ML] add extractInstances ### What changes were proposed in this pull request? common methods support extract weights ### Why are the changes needed? today more and more ML algs support weighting, add this method will make impls simple ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites Closes #25802 from zhengruifeng/add_extractInstances. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-24 09:24:10 -05:00
Xiao Li	7c02c143aa	[SPARK-28292][SQL] Enable Injection of User-defined Hint ### What changes were proposed in this pull request? Move the rule `RemoveAllHints` after the batch `Resolution`. ### Why are the changes needed? User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test case Closes #25746 from gatorsmile/moveRemoveAllHints. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 18:04:17 +08:00
sheepstop	81de9d3c29	[SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python ### What changes were proposed in this pull request? Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component ### Why are the changes needed? It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios. ### Does this PR introduce any user-facing change? Yes, more info provided to user. ### How was this patch tested? No tests added, only doc change. Closes #25704 from sheepstop/master. Authored-by: sheepstop <yangting617@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-24 18:57:54 +09:00

1 2 3 4 5 ...

25418 commits