ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Thomas Graves	b425f8ee65	[SPARK-27492][DOC][YARN][K8S][CORE] Resource scheduling high level user docs ### What changes were proposed in this pull request? Document the resource scheduling feature - https://issues.apache.org/jira/browse/SPARK-24615 Add general docs, yarn, kubernetes, and standalone cluster specific ones. ### Why are the changes needed? Help users understand the feature ### Does this PR introduce any user-facing change? docs ### How was this patch tested? N/A Closes #25698 from tgravescs/SPARK-27492-gpu-sched-docs. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-09-11 08:22:36 -05:00
Kazuaki Ishizaki	8d1b5ba766	[SPARK-28906][BUILD] Fix incorrect information in bin/spark-submit --version ### What changes were proposed in this pull request? This PR allows `bin/spark-submit --version` to show the correct information while the previous versions, which were created by `dev/create-release/do-release-docker.sh`, show incorrect information. There are two root causes to show incorrect information: 1. Did not pass `USER` environment variable to the docker container 1. Did not keep `.git` directory in the work directory ### Why are the changes needed? The information is missing while the previous versions show the correct information. ### Does this PR introduce any user-facing change? Yes, the following is the console output in branch-2.3 ``` $ bin/spark-submit --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.4 /_/ Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212 Branch HEAD Compiled by user ishizaki on 2019-09-02T02:18:10Z Revision `8c6f8150f3` Url https://gitbox.apache.org/repos/asf/spark.git Type --help for more information. ``` Without this PR, the console output is as follows ``` $ spark-submit --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.4 /_/ Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_212 Branch Compiled by user on 2019-08-26T08:29:39Z Revision Url Type --help for more information. ``` ### How was this patch tested? After building the package, I manually executed `bin/spark-submit --version` Closes #25655 from kiszk/SPARK-28906. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-11 08:12:44 -05:00
mcheah	7f36cd2aa5	[SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API ## What changes were proposed in this pull request? Uses the APIs introduced in SPARK-28209 in the UnsafeShuffleWriter. ## How was this patch tested? Since this is just a refactor, existing unit tests should cover the relevant code paths. Micro-benchmarks from the original fork where this code was built show no degradation in performance. Closes #25304 from mccheah/shuffle-writer-refactor-unsafe-writer. Lead-authored-by: mcheah <mcheah@palantir.com> Co-authored-by: mccheah <mcheah@palantir.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-10 17:30:02 -07:00
Mick Jermsurawong	fa75db2059	[SPARK-29026][SQL] Improve error message in `schemaFor` in trait without companion object constructor ### What changes were proposed in this pull request? - For trait without companion object constructor, currently the method to get constructor parameters `constructParams` in `ScalaReflection` will throw exception. ``` scala.ScalaReflectionException: <none> is not a term at scala.reflect.api.Symbols$SymbolApi.asTerm(Symbols.scala:211) at scala.reflect.api.Symbols$SymbolApi.asTerm$(Symbols.scala:211) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:106) at org.apache.spark.sql.catalyst.ScalaReflection.getCompanionConstructor(ScalaReflection.scala:909) at org.apache.spark.sql.catalyst.ScalaReflection.constructParams(ScalaReflection.scala:914) at org.apache.spark.sql.catalyst.ScalaReflection.constructParams$(ScalaReflection.scala:912) at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:47) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:890) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:886) at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:47) ``` - Instead this PR would throw exception: ``` Unable to find constructor for type [XXX]. This could happen if [XXX] is an interface or a trait without companion object constructor UnsupportedOperationException: ``` In the normal usage of ExpressionEncoder, this can happen if the type is interface extending `scala.Product`. Also, since this is a protected method, this could have been other arbitrary types without constructor. ### Why are the changes needed? - The error message `<none> is not a term` isn't helpful for users to understand the problem. ### Does this PR introduce any user-facing change? - The exception would be thrown instead of runtime exception from the `scala.ScalaReflectionException`. ### How was this patch tested? - Added a unit test to illustrate the `type` where expression encoder will fail and trigger the proposed error message. Closes #25736 from mickjermsurawong-stripe/SPARK-29026. Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-11 08:43:40 +09:00
angerszhu	54d3f6e7ec	[SPARK-28982][SQL] Implementation Spark's own GetTypeInfoOperation ### What changes were proposed in this pull request? Current Spark Thrift Server return TypeInfo includes 1. INTERVAL_YEAR_MONTH 2. INTERVAL_DAY_TIME 3. UNION 4. USER_DEFINED Spark doesn't support INTERVAL_YEAR_MONTH, INTERVAL_YEAR_MONTH, UNION and won't return USER)DEFINED type. This PR overwrite GetTypeInfoOperation with SparkGetTypeInfoOperation to exclude types which we don't need. In hive-1.2.1 Type class is `org.apache.hive.service.cli.Type` In hive-2.3.x Type class is `org.apache.hadoop.hive.serde2.thrift.Type` Use ThrifrserverShimUtils to fit version problem and exclude types we don't need ### Why are the changes needed? We should return type info of Spark's own type info ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manuel test & Added UT Closes #25694 from AngersZhuuuu/SPARK-28982. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-10 09:22:50 -07:00
Dilip Biswal	7309e021ec	[SPARK-29028][DOCS] Add links to IBM Cloud Object Storage connector in cloud-integration.md ### What changes were proposed in this pull request? Add links to IBM Cloud Storage connector in cloud-integration.md ### Why are the changes needed? This page mentions the connectors to cloud providers. Currently connector to IBM cloud storage is not specified. This PR adds the necessary links for completeness. ### Does this PR introduce any user-facing change? Yes. Before: <img width="1234" alt="Screen Shot 2019-09-09 at 3 52 44 PM" src="https://user-images.githubusercontent.com/14225158/64571863-11a2c080-d31a-11e9-82e3-78c02675adb9.png"> After. <img width="1234" alt="Screen Shot 2019-09-10 at 8 16 49 AM" src="https://user-images.githubusercontent.com/14225158/64626857-663e4e00-d3a3-11e9-8fa3-15ebf52ea832.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25737 from dilipbiswal/ibm-cloud-storage. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-10 11:19:55 -05:00
Terry Kim	bf43541c92	[SPARK-28856][SQL] Implement SHOW DATABASES for Data Source V2 Tables ### What changes were proposed in this pull request? Implement the SHOW DATABASES logical and physical plans for data source v2 tables. ### Why are the changes needed? To support `SHOW DATABASES` SQL commands for v2 tables. ### Does this PR introduce any user-facing change? `spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set: ``` +---------------+ \| namespace\| +---------------+ \| ns1\| \| ns1.ns1_1\| \|ns1.ns1_1.ns1_2\| +---------------+ ``` ### How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25601 from imback82/show_databases. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-10 21:23:57 +08:00
Marco Gaido	ca6f693ef1	[SPARK-28939][SQL][FOLLOWUP] Avoid useless Properties ### What changes were proposed in this pull request? Removes useless `Properties` created according to hvanhovell 's suggestion. ### Why are the changes needed? Avoid useless code. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? existing UTs Closes #25742 from mgaido91/SPARK-28939_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-10 20:47:55 +09:00
sychen	962e330955	[SPARK-26598][SQL] Fix HiveThriftServer2 cannot be modified hiveconf/hivevar variables ### What changes were proposed in this pull request? The intent to use the --hiveconf/--hivevar parameter is just an initialization value, so setting it once in ```SparkSQLSessionManager#openSession``` is sufficient, and each time the ```SparkExecuteStatementOperation``` setting causes the variable to not be modified. ### Why are the changes needed? It is wrong to set the --hivevar/--hiveconf variable in every ```SparkExecuteStatementOperation```, which prevents variable updates. ### Does this PR introduce any user-facing change? ``` cat <<EOF > test.sql select '\${a}', '\${b}'; set b=bvalue_MOD_VALUE; set b; EOF beeline -u jdbc:hive2://localhost:10000 --hiveconf a=avalue --hivevar b=bvalue -f test.sql ``` current result: ``` +-----------------+-----------------+--+ \| avalue \| bvalue \| +-----------------+-----------------+--+ \| avalue \| bvalue \| +-----------------+-----------------+--+ +-----------------+-----------------+--+ \| key \| value \| +-----------------+-----------------+--+ \| b \| bvalue \| +-----------------+-----------------+--+ 1 row selected (0.022 seconds) ``` after modification: ``` +-----------------+-----------------+--+ \| avalue \| bvalue \| +-----------------+-----------------+--+ \| avalue \| bvalue \| +-----------------+-----------------+--+ +-----------------+-----------------+--+ \| key \| value \| +-----------------+-----------------+--+ \| b \| bvalue_MOD_VALUE\| +-----------------+-----------------+--+ 1 row selected (0.022 seconds) ``` ### How was this patch tested? modified the existing unit test Closes #25722 from cxzl25/fix_SPARK-26598. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-09 22:06:19 -07:00
Dongjoon Hyun	580c6266fb	[SPARK-28939][SQL][FOLLOWUP] Fix JDK11 compilation due to ambiguous reference ### What changes were proposed in this pull request? This PR aims to recover the JDK11 compilation with a workaround. For now, the master branch is broken like the following due to a [Scala bug](https://github.com/scala/bug/issues/10418) which is fixed in `2.13.0-RC2`. ``` [ERROR] [Error] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition, both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit match argument types (java.util.Map[String,String]) ``` - https://github.com/apache/spark/actions (JDK11 build monitoring) ### Why are the changes needed? This workaround recovers JDK11 compilation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual build with JDK11 because this is JDK11 compilation fix. - Jenkins builds with JDK8 and tests with JDK11. - GitHub action will verify this after merging. Closes #25738 from dongjoon-hyun/SPARK-28939. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-09 20:30:49 -07:00
Wenchen Fan	c2d8ee9c54	[SPARK-28878][SQL][FOLLOWUP] Remove extra project for DSv2 streaming scan ### What changes were proposed in this pull request? Remove the project node if the streaming scan is columnar ### Why are the changes needed? This is a followup of https://github.com/apache/spark/pull/25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply #25586 to streaming scan as well. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25727 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-10 11:01:57 +08:00
LantaoJin	86fc890d8c	[SPARK-28988][SQL][TESTS] Fix invalid tests in CliSuite ### What changes were proposed in this pull request? `1f056eb313/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala (L221)` is not strong enough. It will success if class not found. `1f056eb313/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala (L305)` is also incorrect. Whatever the right side value is, it always succeeds. ### Why are the changes needed? Unit tests should failed if the class not found. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exist UTs Closes #25724 from LantaoJin/SPARK-28988. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-10 11:22:06 +09:00
Huaxin Gao	aa805eca54	[SPARK-23265][ML] Update multi-column error handling logic in QuantileDiscretizer ## What changes were proposed in this pull request? SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer. ## How was this patch tested? Add new unit test. Closes #20442 from huaxingao/spark-23265. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-09 19:11:18 -07:00
gengjiaan	aafce7ebff	[SPARK-28412][SQL] ANSI SQL: OVERLAY function support byte array ## What changes were proposed in this pull request? This is a ANSI SQL and feature id is `T312` ``` <binary overlay function> ::= OVERLAY <left paren> <binary value expression> PLACING <binary value expression> FROM <start position> [ FOR <string length> ] <right paren> ``` This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array. ref: https://www.postgresql.org/docs/11/functions-binarystring.html ## How was this patch tested? new UT. There are some show of the PR on my production environment. ``` spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6); Spark_SQL Time taken: 0.285 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7); Spark CORE Time taken: 0.202 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0); Spark ANSI SQL Time taken: 0.165 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4); Structured SQL Time taken: 0.141 s ``` Closes #25172 from beliefer/ansi-overlay-byte-array. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-10 08:16:18 +09:00
hongdd	bdc1598a43	[SPARK-28657][CORE] Fix currentContext Instance failed sometimes ## What changes were proposed in this pull request? Running spark on yarn, I got ``` java.lang.ClassCastException: org.apache.hadoop.ipc.CallerContext$Builder cannot be cast to scala.runtime.Nothing$ ``` Utils.classForName return Class[Nothing], I think it should be defind as Class[_] to resolve this issue ## How was this patch tested? not need Closes #25389 from hddong/SPARK-28657-fix-currentContext-Instance-failed. Lead-authored-by: hongdd <jn_hdd@163.com> Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 18:02:52 -05:00
Gabor Somogyi	e516f7e09e	[SPARK-28928][SS] Use Kafka delegation token protocol on sources/sinks ### What changes were proposed in this pull request? At the moment there are 3 places where communication protocol with Kafka cluster has to be set when delegation token used: * On delegation token * On source * On sink Most of the time users are using the same protocol on all these places (within one Kafka cluster). It would be better to declare it in one place (delegation token side) and Kafka sources/sinks can take this config over. In this PR I've I've modified the code in a way that Kafka sources/sinks are taking over delegation token side `security.protocol` configuration when the token and the source/sink matches in `bootstrap.servers` configuration. This default configuration can be overwritten on each source/sink independently by using `kafka.security.protocol` configuration. ### Why are the changes needed? The actual configuration's default behavior represents the minority of the use-cases and inconvenient. ### Does this PR introduce any user-facing change? Yes, with this change users need to provide less configuration parameters by default. ### How was this patch tested? Existing + additional unit tests. Closes #25631 from gaborgsomogyi/SPARK-28928. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-09 15:41:51 -07:00
Jungtaek Lim (HeartSaVioR)	8018ded217	[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData ### What changes were proposed in this pull request? This patch fixes the bug regarding accessing `DStreamCheckpointData.currentCheckpointFiles` without guarding which makes the test `basic rdd checkpoints + dstream graph checkpoint recovery` being flaky. There're two possible points to make test failing: 1. checkpoint logic is too slow so that checkpoint cannot be handled within real delay 2. There's multithreads-unsafe point in `DStreamCheckpointData.update`: it clears `currentCheckpointFiles` and adds new checkpointFiles. Race condition can happen between main thread for test and JobGenerator's event loop thread. `lastProcessedBatch` guarantees that all events for given time are processed, as commented: `// last batch whose completion,checkpointing and metadata cleanup has been completed`. That means, if we wait for time for exactly same amount as advanced the time in test (multiply of checkpoint interval as well as batch duration) we can expect nothing will happen in DStreamCheckpointData afterwards unless we advance the clock. This patch applies the observation above. ### Why are the changes needed? The test is reported as flaky as [SPARK-28214](https://issues.apache.org/jira/browse/SPARK-28214), and the test code seems unsafe. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified UT. I've added some debug messages and confirmed no method in DStreamCheckpointData is being called between "after waiting lastProcessedBatch" and "advancing clock" even I added huge amount of sleep between twos, which avoids race-condition. I was also able to make existing test artificially failing (not 100% consistently but high likely) via adding sleep between `currentCheckpointFiles.clear()` and `currentCheckpointFiles ++= checkpointFiles` in `DStreamCheckpointData.update`, and confirmed modified test doesn't fail the test multiple times. Closes #25731 from HeartSaVioR/SPARK-28214. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-09 15:36:36 -07:00
Huaxin Gao	125af78d32	[SPARK-28831][DOC][SQL] Document CLEAR CACHE statement in SQL Reference ### What changes were proposed in this pull request? Document CLEAR CACHE statement in SQL Reference ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes After change: ![image](https://user-images.githubusercontent.com/13592258/64565512-caf89a80-d308-11e9-99ea-88e966d1b1a1.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25541 from huaxingao/spark-28831-n. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-09 14:28:55 -07:00
Dilip Biswal	c839d09789	[SPARK-28773][DOC][SQL] Handling of NULL data in Spark SQL ### What changes were proposed in this pull request? Document ```NULL``` semantics in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on how `NULL` data is handled in various expressions and operators. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-09-08 at 11 24 41 PM" src="https://user-images.githubusercontent.com/14225158/64507782-83362c80-d290-11e9-8295-70de412ea1f4.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 24 56 PM" src="https://user-images.githubusercontent.com/14225158/64507784-83362c80-d290-11e9-8f85-fbaf6116905f.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 25 08 PM" src="https://user-images.githubusercontent.com/14225158/64507785-83362c80-d290-11e9-9f9a-1dbafbc33bba.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 25 24 PM" src="https://user-images.githubusercontent.com/14225158/64507787-83362c80-d290-11e9-99b0-fcaa4a1f9a2d.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 25 34 PM" src="https://user-images.githubusercontent.com/14225158/64507789-83cec300-d290-11e9-94e7-feb8cf65d7ce.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 25 49 PM" src="https://user-images.githubusercontent.com/14225158/64507790-83cec300-d290-11e9-8c68-d745e7e9e4ca.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 26 00 PM" src="https://user-images.githubusercontent.com/14225158/64507791-83cec300-d290-11e9-9590-1e4c7ae28dac.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 26 09 PM" src="https://user-images.githubusercontent.com/14225158/64507792-83cec300-d290-11e9-885a-58752633ee71.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 26 20 PM" src="https://user-images.githubusercontent.com/14225158/64507793-83cec300-d290-11e9-8af8-9ef17034accb.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 26 32 PM" src="https://user-images.githubusercontent.com/14225158/64507794-83cec300-d290-11e9-874b-0d419cadbf75.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 26 47 PM" src="https://user-images.githubusercontent.com/14225158/64507795-84675980-d290-11e9-9ce6-870b46b060bc.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 26 59 PM" src="https://user-images.githubusercontent.com/14225158/64507796-84675980-d290-11e9-91cc-d6ffc5e3374d.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 27 10 PM" src="https://user-images.githubusercontent.com/14225158/64507797-84675980-d290-11e9-9d36-dcc6b1e75f38.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 27 18 PM" src="https://user-images.githubusercontent.com/14225158/64507798-84675980-d290-11e9-842c-8d57877b4389.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 27 27 PM" src="https://user-images.githubusercontent.com/14225158/64507799-84675980-d290-11e9-881d-16a24c6f5acd.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 27 37 PM" src="https://user-images.githubusercontent.com/14225158/64507801-84675980-d290-11e9-8f52-875a7a3c92c1.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 27 48 PM" src="https://user-images.githubusercontent.com/14225158/64507802-84675980-d290-11e9-9586-1d66fc07c069.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 27 59 PM" src="https://user-images.githubusercontent.com/14225158/64507804-84fff000-d290-11e9-8378-2d1a6cfa76d2.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 28 08 PM" src="https://user-images.githubusercontent.com/14225158/64507805-84fff000-d290-11e9-81ec-abeec2842922.png"> <img width="1234" alt="Screen Shot 2019-09-08 at 11 28 20 PM" src="https://user-images.githubusercontent.com/14225158/64507806-84fff000-d290-11e9-900f-1debb28f8f93.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25726 from dilipbiswal/sql-ref-null-data. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-09 13:41:17 -07:00
Sean Owen	6378d4bc06	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 ### What changes were proposed in this pull request? - Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods - Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport` - Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0 - Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0 - Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD - Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0 - Remove deprecated ChiSqSelector isSorted protected method - Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc Notes: - I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset. - Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was. - I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird. - I kept LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated. ### Why are the changes needed? Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old. ### Does this PR introduce any user-facing change? Yes, in that deprecated items are removed from some public APIs. ### How was this patch tested? Existing tests. Closes #25684 from srowen/SPARK-28980. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 10:19:40 -05:00
Marco Gaido	3d6b33a49a	[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd ### What changes were proposed in this pull request? The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it. In this way, SQL configs are effective also in these cases, while earlier they were ignored. ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Does this PR introduce any user-facing change? When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy. ### How was this patch tested? added UT Closes #25643 from mgaido91/SPARK-28939. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 21:20:34 +08:00
Wenchen Fan	abec6d7763	[SPARK-28341][SQL] create a public API for V2SessionCatalog ## What changes were proposed in this pull request? The `V2SessionCatalog` has 2 functionalities: 1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`. 2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog). To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage. This PR does 2 things: 1. refine the document of the config `spark.sql.catalog.session`. 2. add a public abstract class `CatalogExtension` for users to write implementations. TODOs for followup PRs: 1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one. 2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names. ## How was this patch tested? existing tests Closes #25104 from cloud-fan/session-catalog. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 21:14:37 +08:00
colinma	dadb72028a	[SPARK-28340][CORE] Noisy exceptions when tasks are killed: "DiskBloc… ### What changes were proposed in this pull request? If a Spark task is killed due to intentional job kills, automated killing of redundant speculative tasks, etc, ClosedByInterruptException occurs if task has unfinished I/O operation with AbstractInterruptibleChannel. A single cancelled task can result in hundreds of stack trace of ClosedByInterruptException being logged. In this PR, stack trace of ClosedByInterruptException won't be logged like Executor.run do for InterruptedException. ### Why are the changes needed? Large numbers of spurious exceptions is confusing to users when they are inspecting Spark logs to diagnose other issues. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25674 from colinmjj/spark-28340. Authored-by: colinma <colinma@tencent.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 05:27:53 -05:00
Yuming Wang	4a3a6b66be	[SPARK-28637][SQL] Thriftserver support interval type ## What changes were proposed in this pull request? `bin/spark-shell` support query interval value: ```scala scala> spark.sql("SELECT interval 3 months 1 hours AS i").show(false) +-------------------------+ \|i \| +-------------------------+ \|interval 3 months 1 hours\| +-------------------------+ ``` But `sbin/start-thriftserver.sh` can't support query interval value: ```sql 0: jdbc:hive2://localhost:10000/default> SELECT interval 3 months 1 hours AS i; Error: java.lang.IllegalArgumentException: Unrecognized type name: interval (state=,code=0) ``` This PR maps `CalendarIntervalType` to `StringType` for `TableSchema` to make Thriftserver support query interval value because we do not support `INTERVAL_YEAR_MONTH` type and `INTERVAL_DAY_TIME`: `02c33694c8/sql/hive-thriftserver/v1.2.1/src/main/java/org/apache/hive/service/cli/Type.java (L73-L78)` [SPARK-27791](https://issues.apache.org/jira/browse/SPARK-27791): Support SQL year-month INTERVAL type [SPARK-27793](https://issues.apache.org/jira/browse/SPARK-27793): Support SQL day-time INTERVAL type ## How was this patch tested? unit tests Closes #25277 from wangyum/Thriftserver-support-interval-type. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-08 23:20:27 -07:00
turbofei	d4eca7c99d	[SPARK-29000][SQL] Decimal precision overflow when don't allow precision loss ### What changes were proposed in this pull request? When we set spark.sql.decimalOperations.allowPrecisionLoss to false. For the sql below, the result will overflow and return null. Case a: `select case when 1=2 then 1 else 1.000000000000000000000001 end * 1` Similar with the division operation. This sql below will lost precision. Case b: `select case when 1=2 then 1 else 1.000000000000000000000001 end / 1` Let us check the code of TypeCoercion.scala. `a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875)`. For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a common type and cast both operands to common type. So, for these cases menthioned, their left operand is Decimal(34, 24) and right operand is Literal. Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24). Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class. Let's check the relative code. `a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)` When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20). Then the multi operation in case a will overflow and division operation in case b will lost precision. In this PR, we skip to handle the binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it. ### Why are the changes needed? Data will corrupt without this change. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25701 from turboFei/SPARK-29000. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 13:50:17 +08:00
Marco Gaido	c411579355	[SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable\|Unsafe]Projection ### What changes were proposed in this pull request? The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable\|Unsafe]Projection`. ### Why are the changes needed? Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT Closes #25642 from mgaido91/SPARK-28916. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 13:30:56 +08:00
Holden Karau	0ed9fae457	[SPARK-28886][K8S] Fix the DepsTestsSuite with minikube 1.3.1 ### What changes were proposed in this pull request? Matches the response from minikube service against a regex to extract the URL ### Why are the changes needed? minikube 1.3.1 on OSX has different formatting than expected ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran the existing integration test run on OSX with minikube 1.3.1 Closes #25599 from holdenk/SPARK-28886-fix-deps-tests-with-minikube-1.3.1. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-08 20:04:16 -05:00
Kengo Seki	1f056eb313	[SPARK-27420][DSTREAMS][KINESIS] KinesisInputDStream should expose a way to configure CloudWatch metrics ## What changes were proposed in this pull request? KinesisInputDStream currently does not provide a way to disable CloudWatch metrics push. Its default level is "DETAILED" which pushes 10s of metrics every 10 seconds. When dealing with multiple streaming jobs this add up pretty quickly, leading to thousands of dollars in cost. To address this problem, this PR adds interfaces for accessing KinesisClientLibConfiguration's `withMetrics` and `withMetricsEnabledDimensions` methods to KinesisInputDStream so that users can configure KCL's metrics levels and dimensions. ## How was this patch tested? By running updated unit tests in KinesisInputDStreamBuilderSuite. In addition, I ran a Streaming job with MetricsLevel.NONE and confirmed: * there's no data point for the "Operation", "Operation, ShardId" and "WorkerIdentifier" dimensions on the AWS management console * there's no DEBUG level message from Amazon KCL, such as "Successfully published xx datums." Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24651 from sekikn/SPARK-27420. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-08 19:48:53 -05:00
shivusondur	cb488ecf41	[SPARK-28942][WEBUI] Spark in local mode hostname display localhost in the Host Column of Task Summary Page ### What changes were proposed in this pull request? In spark-shell local mode, in the task page, host name is coming as localhost This PR changes it to show machine IP, as shown in the "spark.driver.host" in the environment page ### Why are the changes needed? To show the proper IP in the task page host column ### Does this PR introduce any user-facing change? It updates the SPARK UI->Task page->Host Column ### How was this patch tested? verfied in spark UI ![image](https://user-images.githubusercontent.com/7912929/64079045-253d9e00-cd00-11e9-8092-26caec4e21dc.png) Closes #25645 from shivusondur/localhost1. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-08 19:45:19 -05:00
Yuming Wang	a75467432e	[SPARK-28000][SQL][TEST] Port comments.sql ## What changes were proposed in this pull request? This PR is to port comments.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/comments.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/expected/comments.out When porting the test cases, found one PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28880](https://issues.apache.org/jira/browse/SPARK-28880): ANSI SQL: Bracketed comments ## How was this patch tested? N/A Closes #25588 from wangyum/SPARK-28000. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-08 10:32:08 +09:00
avk	723faadf80	[SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles() ### What changes were proposed in this pull request? This change fixes issue SPARK-28912. ### Why are the changes needed? If checkpoint directory is set to name which matches regex pattern used for checkpoint files then logs are flooded with MatchError exceptions and old checkpoint files are not removed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually. 1. Start Hadoop in a pseudo-distributed mode. 2. In another terminal run command nc -lk 9999 3. In the Spark shell execute the following statements: ```scala val ssc = new StreamingContext(sc, Seconds(30)) ssc.checkpoint("hdfs://localhost:9000/checkpoint-01") val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() ``` Closes #25654 from avkgh/SPARK-28912. Authored-by: avk <nullp7r@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-06 17:55:09 -07:00
Nicholas Marion	6fb5ef108e	[SPARK-29011][BUILD] Update netty-all from 4.1.30-Final to 4.1.39-Final ### What changes were proposed in this pull request? Upgrade netty-all to latest in the 4.1.x line which is 4.1.39-Final. ### Why are the changes needed? Currency of dependencies. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit-tests against master branch. Closes #25712 from n-marion/master. Authored-by: Nicholas Marion <nmarion@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-06 17:48:53 -07:00
Liang-Chi Hsieh	89aba69378	[SPARK-28935][SQL][DOCS] Document SQL metrics for Details for Query Plan ### What changes were proposed in this pull request? This patch adds the description of common SQL metrics in web ui document. ### Why are the changes needed? The current web ui document describes query plan but does not describe the meaning SQL metrics. For end users, they might not understand the meaning of the metrics. ### Does this PR introduce any user-facing change? No. This is just documentation change. ### How was this patch tested? Built the docs locally. ![image](https://user-images.githubusercontent.com/11567269/64463485-1583d800-d0b9-11e9-9916-141f5c09f009.png) Closes #25658 from viirya/SPARK-28935. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-06 15:56:50 -07:00
Takeshi Yamamuro	ff5fa5873e	[SPARK-21870][SQL][FOLLOW-UP] Clean up string template formats for generated code in HashAggregateExec ### What changes were proposed in this pull request? This pr cleans up string template formats for generated code in HashAggregateExec. This changes comes from rednaxelafx comment: https://github.com/apache/spark/pull/20965#discussion_r316418729 ### Why are the changes needed? To improve code-readability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25714 from maropu/SPARK-21870-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-07 07:16:36 +09:00
maryannxue	b2f06608b7	[SPARK-29002][SQL] Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions ### What changes were proposed in this pull request? This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ. ### Why are the changes needed? It is a performance improvement for AQE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #25703 from maryannxue/aqe-demote-bhj. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-06 12:46:54 -07:00
Maxim Gekk	67b4329fb0	[SPARK-28690][SQL] Add `date_part` function for timestamps/dates ## What changes were proposed in this pull request? In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`: ``` date_part('field', source) ``` and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT). The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are: - `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. - `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. - `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. - isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. - `year`, `month`, `day`, `hour`, `minute`, `second` - `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. - `quarter` - the quarter of the year (1 - 4) - `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday) - `dow` - the day of the week as Sunday (0) to Saturday (6) - `isodow` - the day of the week as Monday (1) to Sunday (7) - `doy` - the day of the year (1 - 365/366) - `milliseconds` - the seconds field including fractional parts multiplied by 1,000. - `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. - `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. Here are examples: ```sql spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456'); 2019 spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456'); 224 ``` I changed implementation of `extract` to re-use `date_part()` internally. ## How was this patch tested? Added `date_part.sql` and regenerated results of `extract.sql`. Closes #25410 from MaxGekk/date_part. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-06 23:36:00 +09:00
Jungtaek Lim (HeartSaVioR)	905b7f7fc7	[SPARK-28967][CORE] Include cloned version of "properties" to avoid ConcurrentModificationException ### What changes were proposed in this pull request? This patch fixes the bug which throws ConcurrentModificationException when job with 0 partition is submitted via DAGScheduler. ### Why are the changes needed? Without this patch, structured streaming query throws ConcurrentModificationException, like below stack trace: ``` 19/09/04 09:48:49 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.util.ConcurrentModificationException at java.util.Hashtable$Enumerator.next(Hashtable.java:1387) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:237) at scala.collection.TraversableLike.map$(TraversableLike.scala:230) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:514) at org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:520) at scala.Option.map(Option.scala:163) at org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:519) at org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:155) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:149) at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:217) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93) ``` Please refer https://issues.apache.org/jira/browse/SPARK-28967 for detailed reproducer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Newly added UT. Also manually tested via running simple structured streaming query in spark-shell. Closes #25672 from HeartSaVioR/SPARK-28967. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-06 09:06:39 -05:00
zhengruifeng	4664a082c2	[SPARK-28968][ML] Add HasNumFeatures in the scala side ### What changes were proposed in this pull request? Add HasNumFeatures in the scala side, with `1<<18` as the default value ### Why are the changes needed? HasNumFeatures is already added in the py side, it is reasonable to keep them in sync. I don't find other similar place. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing testsuites Closes #25671 from zhengruifeng/add_HasNumFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-06 11:50:45 +08:00
Takeshi Yamamuro	cb0cddffe9	[SPARK-21870][SQL] Split aggregation code into small functions ## What changes were proposed in this pull request? This pr proposed to split aggregation code into small functions in `HashAggregateExec`. In #18810, we got performance regression if JVMs didn't compile too long functions. I checked and I found the codegen of `HashAggregateExec` frequently goes over the limit when a query has too many aggregate functions (e.g., q66 in TPCDS). The current master places all the generated aggregation code in a single function. In this pr, I modified the code to assign an individual function for each aggregate function (e.g., `SUM` and `AVG`). For example, in a query `SELECT SUM(a), AVG(a) FROM VALUES(1) t(a)`, the proposed code defines two functions for `SUM(a)` and `AVG(a)` as follows; - generated code with this pr (https://gist.github.com/maropu/812990012bc967a78364be0fa793f559): ``` /* 173 / private void agg_doConsume_0(InternalRow inputadapter_row_0, long agg_expr_0_0, boolean agg_exprIsNull_0_0, double agg_expr_1_0, boolean agg_exprIsNull_1_0, long agg_expr_2_0, boolean agg_exprIsNull_2_0) throws java.io.IOException { / 174 / // do aggregate / 175 / // common sub-expressions / 176 / / 177 / // evaluate aggregate functions and update aggregation buffers / 178 / agg_doAggregate_sum_0(agg_exprIsNull_0_0, agg_expr_0_0); / 179 / agg_doAggregate_avg_0(agg_expr_1_0, agg_exprIsNull_1_0, agg_exprIsNull_2_0, agg_expr_2_0); / 180 / / 181 / } ... / 071 / private void agg_doAggregate_avg_0(double agg_expr_1_0, boolean agg_exprIsNull_1_0, boolean agg_exprIsNull_2_0, long agg_expr_2_0) throws java.io.IOException { / 072 / // do aggregate for avg / 073 / // evaluate aggregate function / 074 / boolean agg_isNull_19 = true; / 075 / double agg_value_19 = -1.0; ... / 114 / private void agg_doAggregate_sum_0(boolean agg_exprIsNull_0_0, long agg_expr_0_0) throws java.io.IOException { / 115 / // do aggregate for sum / 116 / // evaluate aggregate function / 117 / agg_agg_isNull_11_0 = true; / 118 / long agg_value_11 = -1L; ``` - generated code in the current master (https://gist.github.com/maropu/e9d772af2c98d8991a6a5f0af7841760) ``` / 059 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0) throws java.io.IOException { / 060 / // do aggregate / 061 / // common sub-expressions / 062 / boolean agg_isNull_4 = false; / 063 / long agg_value_4 = -1L; / 064 / if (!false) { / 065 / agg_value_4 = (long) agg_expr_0_0; / 066 / } / 067 / // evaluate aggregate function / 068 / agg_agg_isNull_7_0 = true; / 069 / long agg_value_7 = -1L; / 070 / do { / 071 / if (!agg_bufIsNull_0) { / 072 / agg_agg_isNull_7_0 = false; / 073 / agg_value_7 = agg_bufValue_0; / 074 / continue; / 075 / } / 076 / / 077 / boolean agg_isNull_9 = false; / 078 / long agg_value_9 = -1L; / 079 / if (!false) { / 080 / agg_value_9 = (long) 0; / 081 / } / 082 / if (!agg_isNull_9) { / 083 / agg_agg_isNull_7_0 = false; / 084 / agg_value_7 = agg_value_9; / 085 / continue; / 086 / } / 087 / / 088 / } while (false); / 089 / / 090 / long agg_value_6 = -1L; / 091 / / 092 / agg_value_6 = agg_value_7 + agg_value_4; / 093 / boolean agg_isNull_11 = true; / 094 / double agg_value_11 = -1.0; / 095 / / 096 / if (!agg_bufIsNull_1) { / 097 / agg_agg_isNull_13_0 = true; / 098 / double agg_value_13 = -1.0; / 099 / do { / 100 / boolean agg_isNull_14 = agg_isNull_4; / 101 / double agg_value_14 = -1.0; / 102 / if (!agg_isNull_4) { / 103 / agg_value_14 = (double) agg_value_4; / 104 / } / 105 / if (!agg_isNull_14) { / 106 / agg_agg_isNull_13_0 = false; / 107 / agg_value_13 = agg_value_14; / 108 / continue; / 109 / } / 110 / / 111 / boolean agg_isNull_15 = false; / 112 / double agg_value_15 = -1.0; / 113 / if (!false) { / 114 / agg_value_15 = (double) 0; / 115 / } / 116 / if (!agg_isNull_15) { / 117 / agg_agg_isNull_13_0 = false; / 118 / agg_value_13 = agg_value_15; / 119 / continue; / 120 / } / 121 / / 122 / } while (false); / 123 / / 124 / agg_isNull_11 = false; // resultCode could change nullability. / 125 / / 126 / agg_value_11 = agg_bufValue_1 + agg_value_13; / 127 / / 128 / } / 129 / boolean agg_isNull_17 = false; / 130 / long agg_value_17 = -1L; / 131 / if (!false && agg_isNull_4) { / 132 / agg_isNull_17 = agg_bufIsNull_2; / 133 / agg_value_17 = agg_bufValue_2; / 134 / } else { / 135 / boolean agg_isNull_20 = true; / 136 / long agg_value_20 = -1L; / 137 / / 138 / if (!agg_bufIsNull_2) { / 139 / agg_isNull_20 = false; // resultCode could change nullability. / 140 / / 141 / agg_value_20 = agg_bufValue_2 + 1L; / 142 / / 143 / } / 144 / agg_isNull_17 = agg_isNull_20; / 145 / agg_value_17 = agg_value_20; / 146 / } / 147 / // update aggregation buffer / 148 / agg_bufIsNull_0 = false; / 149 / agg_bufValue_0 = agg_value_6; / 150 / / 151 / agg_bufIsNull_1 = agg_isNull_11; / 152 / agg_bufValue_1 = agg_value_11; / 153 / / 154 / agg_bufIsNull_2 = agg_isNull_17; / 155 / agg_bufValue_2 = agg_value_17; / 156 / / 157 */ } ``` You can check the previous discussion in https://github.com/apache/spark/pull/19082 ## How was this patch tested? Existing tests Closes #20965 from maropu/SPARK-21870-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-06 11:45:14 +08:00
Kevin Yu	36f8e53cfa	[SPARK-28802][DOC][SQL] Document DESCRIBE DATABASE statement in SQL Reference ### What changes were proposed in this pull request? Document DESCRIBE DATABASE statement in SQL Reference ### Why are the changes needed? To complete the SQL Reference ### Does this PR introduce any user-facing change? Yes #### Before There is no documentation for this command in sql reference #### After ![Screen Shot 2019-09-05 at 12 59 32 PM](https://user-images.githubusercontent.com/7550280/64379235-53aec800-cfe3-11e9-8a51-ea55f0455c47.png) ![Screen Shot 2019-09-05 at 12 59 45 PM](https://user-images.githubusercontent.com/7550280/64379247-58737c00-cfe3-11e9-9a51-f12c5c5bc26a.png) ### How was this patch tested? Used jekyll build and serve to verify Closes #25528 from kevinyu98/sql-ref-describe. Lead-authored-by: Kevin Yu <qyu@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-05 16:23:08 -07:00
Mukul Murthy	3929d16604	[SPARK-26046][SS] Add StreamingQueryManager.listListeners() ### What changes were proposed in this pull request? Add a listListeners() method to StreamingQueryManager that lists all StreamingQueryListeners that have been added to that manager. ### Why are the changes needed? While it's best practice to keep handles on all listeners added, it's still nice to have an API to be able to list what listeners have been added to a StreamingQueryManager. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified existing unit tests to use the new API instead of using reflection. Closes #25518 from mukulmurthy/26046-listener. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>	2019-09-05 14:27:54 -07:00
Wing Yew Poon	151b954e52	[SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail ### What changes were proposed in this pull request? `ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged. We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus. ### Why are the changes needed? As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25673 from wypoon/SPARK-28770. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-09-05 15:55:22 -05:00
Bogdan Ghit	0647906f12	[SPARK-28910][SQL] Prevent schema verification when connecting to in memory derby ## What changes were proposed in this pull request? This PR disables schema verification and allows schema auto-creation in the Derby database, in case the config for the Metastore is set otherwise. ## How was this patch tested? NA Closes #25663 from bogdanghit/hive-schema. Authored-by: Bogdan Ghit <bogdan.ghit@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-05 07:06:19 -07:00
Wenchen Fan	c81fd0cd61	[SPARK-28974][SQL] centralize the Data Source V2 table capability checks ### What changes were proposed in this pull request? merge the `V2WriteSupportCheck` and `V2StreamingScanSupportCheck` to one rule: `TableCapabilityCheck`. ### Why are the changes needed? It's a little confusing to have 2 rules to check DS v2 table capability, while one rule says it checks write and another rule says it checks streaming scan. We can clearly tell it from the rule names that the batch scan check is missing. It's better to have a centralized place for this check, with a name that clearly says it checks table capability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25679 from cloud-fan/dsv2-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 20:22:29 +08:00
HyukjinKwon	103d50b3f6	[SPARK-28272][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base. <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out index f102383cb4d..eff33f280cf 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out -3,7 +3,7 -- !query 0 -select max(min(unique1)) from tenk1 +select udf(max(min(unique1))) from tenk1 -- !query 0 schema struct<> -- !query 0 output -12,11 +12,11 It is not allowed to use an aggregate function in the argument of another aggreg -- !query 1 -select (select count() - from (values (1)) t0(inner_c)) +select udf((select udf(count()) + from (values (1)) t0(inner_c))) as col from (values (2),(3)) t1(outer_c) -- !query 1 schema -struct<scalarsubquery():bigint> +struct<col:bigint> -- !query 1 output 1 1 ``` </p> </details> ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25676 from HyukjinKwon/SPARK-28272. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:35:21 +09:00
HyukjinKwon	be04c97262	[SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base. <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary> <p> ```diff ``` </p> </details> ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25677 from HyukjinKwon/SPARK-28971. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:34:44 +09:00
Sean Owen	36559b6525	[SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc that partition column can be numeric, date or timestamp type ### What changes were proposed in this pull request? `DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too. ### Why are the changes needed? scaladoc is incorrect. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25687 from srowen/SPARK-28977. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:32:45 +09:00
WeichenXu	f8bc91f749	[SPARK-28782][SQL] Generator support in aggregate expressions ### What changes were proposed in this pull request? Support generator in aggregate expressions. In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e: ``` aggregate(with generator) \|--child_plan ``` ===> ``` project \|--generator(resolved) \|--aggregate \|--child_plan ``` ### Why are the changes needed? We should support sql like: ``` select explode(array(min(a), max(a))) from t ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test added. Closes #25512 from WeichenXu123/explode_bug. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 16:17:49 +08:00
Ryan Blue	dde393142f	[SPARK-28878][SQL] Remove extra project for DSv2 reads with columnar batches ### What changes were proposed in this pull request? Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution. ### Why are the changes needed? Removes an extra projection and copy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 15:38:46 +08:00
Burak Yavuz	b9edd44bd6	[SPARK-28964] Add the provider information to the table properties in saveAsTable ### What changes were proposed in this pull request? Adds the provider information to the table properties in saveAsTable. ### Why are the changes needed? Otherwise, catalog implementations don't know what kind of Table definition to create. ### Does this PR introduce any user-facing change? nope ### How was this patch tested? Existing unit tests check the existence of the provider now. Closes #25669 from brkyvz/provider. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 14:33:35 +08:00

... 7 8 9 10 11 ...

25538 commits