ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	dbcb4792f2	[SPARK-27161][SQL] improve the document of SQL keywords ## What changes were proposed in this pull request? Make it more clear about how Spark categories keywords regarding to the config `spark.sql.parser.ansi.enabled` ## How was this patch tested? existing tests Closes #24093 from cloud-fan/parser. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-18 15:19:52 +09:00
André Sá de Mello	f9180f8752	[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions ## What changes were proposed in this pull request? Most SQL functions defined in `spark.sql.functions` have two calling patterns, one with a Column object as input, and another with a string representing a column name, which is then converted into a Column object internally. There are, however, a few notable exceptions: - lower() - upper() - abs() - bitwiseNOT() - ltrim() - rtrim() - trim() - ascii() - base64() - unbase64() While this doesn't break anything, as you can easily create a Column object yourself prior to passing it to one of these functions, it has two undesirable consequences: 1. It is surprising - it breaks coder's expectations when they are first starting with Spark. Every API should be as consistent as possible, so as to make the learning curve smoother and to reduce causes for human error; 2. It gets in the way of stylistic conventions. Most of the time it makes Python code more readable to use literal names, and the API provides ample support for that, but these few exceptions prevent this pattern from being universally applicable. This patch is meant to fix the aforementioned problem. ### Effect This patch enables support for passing column names as input to those functions mentioned above. ### Side effects This PR also fixes an issue with some functions being defined multiple times by using `_create_function()`. ### How it works `_create_function()` was redefined to always convert the argument to a Column object. The old implementation has been kept under `_create_name_function()`, and is still being used to generate the following special functions: - lit() - col() - column() - asc() - desc() - asc_nulls_first() - asc_nulls_last() - desc_nulls_first() - desc_nulls_last() This is because these functions can only take a column name as their argument. This is not a problem, as their semantics require so. ## How was this patch tested? Ran ./dev/run-tests and tested it manually. Closes #23882 from asmello/col-name-support-pyspark. Authored-by: André Sá de Mello <amello@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 12:58:16 -05:00
Ajith	fc88d3df5c	[SPARK-27164][CORE] RDD.countApprox on empty RDDs schedules jobs which never complete ## What changes were proposed in this pull request? When Result stage has zero tasks, the Job End event is never fired, hence the Job is always running in UI. Example: sc.emptyRDD[Int].countApprox(1000) never finishes even it has no tasks to launch ## How was this patch tested? Added UT Closes #24100 from ajithme/emptyRDD. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 12:56:41 -05:00
fitermay	1bc481b779	[SPARK-27070] Improve performance of DefaultPartitionCoalescer This time tested against Scala 2.11 as well Closes #24116 from fitermay/master. Authored-by: fitermay <fiterman@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 11:47:14 -05:00
Ajith	c324e1da9d	[SPARK-27122][CORE] Jetty classes must not be return via getters in org.apache.spark.ui.WebUI ## What changes were proposed in this pull request? When we run YarnSchedulerBackendSuite, the class path seems to be made from the classes folder(resource-managers/yarn/target/scala-2.12/classes) instead of jar (resource-managers/yarn/target/spark-yarn_2.12-3.0.0-SNAPSHOT.jar) . ui.getHandlers is in spark-core and its loaded from spark-core.jar which is shaded and hence refers to org.spark_project.jetty.servlet.ServletContextHandler Here in org.apache.spark.scheduler.cluster.YarnSchedulerBackend, as its not shaded, it expects org.eclipse.jetty.servlet.ServletContextHandler Refer discussion https://issues.apache.org/jira/browse/SPARK-27122?focusedCommentId=16792318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16792318 Hence as a fix, org.apache.spark.ui.WebUI must only return a wrapper class instance or references so that Jetty classes can be avoided in getters which are accessed outside spark-core ## How was this patch tested? Existing UT can pass Closes #24088 from ajithme/shadebug. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 06:44:02 -05:00
lichaoqun	4132c989db	[MINOR][CORE] spark.diskStore.subDirectories <= 0 should throw Exception ## What changes were proposed in this pull request? this pr add check this spark.diskStore.subDirectories > 0.This value need to be checked before it can be used. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24024 from lcqzte10192193/wid-lcq-190308. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 06:43:14 -05:00
hehuiyuan	d6a3cbea5d	[MINOR][DOC] Add "completedStages" metircs for namespace=appStatus ## What changes were proposed in this pull request? Add completedStages metircs for namespace=appStatus for monitoring.md. Closes #24109 from hehuiyuan/hehuiyuan-patch-5. Authored-by: hehuiyuan <hehuiyuan@ZBMAC-C02WD3K5H.local> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 06:42:00 -05:00
Liupengcheng	cad475dcc9	[SPARK-26941][YARN] Fix incorrect computation of maxNumExecutorFailures in ApplicationMaster for streaming ## What changes were proposed in this pull request? Currently, when enabled streaming dynamic allocation for streaming applications, the maxNumExecutorFailures in ApplicationMaster is still computed with `spark.dynamicAllocation.maxExecutors`. Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` instead. Related codes: `f87153a3ac/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (L101)` ## How was this patch tested? NA Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23845 from liupc/Fix-incorrect-maxNumExecutorFailures-for-streaming. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 19:45:05 -05:00
Yuming Wang	9c0af746e5	[SPARK-27175][BUILD] Upgrade hadoop-3 to 3.2.0 ## What changes were proposed in this pull request? This PR upgrade `hadoop-3` to `3.2.0` to workaround [HADOOP-16086](https://issues.apache.org/jira/browse/HADOOP-16086). Otherwise some test case will throw IllegalArgumentException: ```java 02:44:34.707 ERROR org.apache.hadoop.hive.ql.exec.Task: Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)' java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:116) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:109) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:102) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475) at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:369) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:730) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:719) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:709) at org.apache.spark.sql.hive.StatisticsSuite.createNonPartitionedTable(StatisticsSuite.scala:719) at org.apache.spark.sql.hive.StatisticsSuite.$anonfun$testAlterTableProperties$2(StatisticsSuite.scala:822) ``` ## How was this patch tested? manual tests Closes #24106 from wangyum/SPARK-27175. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 19:42:05 -05:00
Jungtaek Lim (HeartSaVioR)	4adbcdc424	[SPARK-22000][SQL][FOLLOW-UP] Fix bad test to ensure it can test properly ## What changes were proposed in this pull request? There was some mistake on test code: it has wrong assertion. The patch proposes fixing it, as well as fixing other stuff to make test really pass. ## How was this patch tested? Fixed unit test. Closes #24112 from HeartSaVioR/SPARK-22000-hotfix. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-17 08:25:40 +09:00
Lantao Jin	6a6075ac96	[SPARK-27157][DOCS] Add Executor level metrics to monitoring docs ## What changes were proposed in this pull request? A sub-task of [SPARK-23206](https://issues.apache.org/jira/browse/SPARK-23206) Add Executor level metrics to monitoring docs ## How was this patch tested? jekyll Closes #24090 from LantaoJin/SPARK-27157. Authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 14:52:19 -05:00
Dilip Biswal	aea9a574c4	[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ \|array_col \| +----------------------------------------+ \|[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]\| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ \|array_distinct(array_col)\| +-------------------------+ \|[[1, 2], [1, 2], [1, 2]] \| +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ \|array_distinct(array_col)\| +-------------------------+ \|[[1, 2], [3, 4], [4, 5]] \| +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 14:30:42 -05:00
Dilip Biswal	7a136f8670	[SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r ## What changes were proposed in this pull request? This is a minor follow-up PR for SPARK-27096. The original PR reconciled the join types supported between dataset and sql interface. In case of R, we do the join type validation in the R side. In this PR we do the correct validation and adds tests in R to test all the join types along with the error condition. Along with this, i made the necessary doc correction. ## How was this patch tested? Add R tests. Closes #24087 from dilipbiswal/joinfix_followup. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-16 13:04:54 +09:00
SongYadong	ec11790580	[CORE][MINOR] Correct the comment to show heartbeat interval is configurable ## What changes were proposed in this pull request? Executor heartbeat interval is configurable by `"spark.executor.heartbeatInterval"`. But in a comment, heartbeat interval is presented as a constant `10s`. This pr tries to correct the description. ## How was this patch tested? Existing unit tests. Closes #24101 from SongYadong/heartbeat_interval_comment. Authored-by: SongYadong <song.yadong1@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-15 20:30:36 -05:00
Zhu, Lipeng	8ee09f26d5	[SPARK-27159][SQL] update mssql server dialect to support binary type ## What changes were proposed in this pull request? Change the binary type mapping from default blob to varbinary(max) for mssql server. https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017 ![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png) ## How was this patch tested? Unit test. Closes #24091 from lipzhu/SPARK-27159. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-15 20:21:59 -05:00
Dongjoon Hyun	4bab69b22a	Revert "[SPARK-27070] Fix performance bug in DefaultPartitionCoalescer" This reverts commit `21db4336b0`.	2019-03-15 14:56:08 -07:00
Gengliang Wang	2a37d6ed93	[SPARK-27132][SQL] Improve file source V2 framework ## What changes were proposed in this pull request? During the migration of CSV V2(https://github.com/apache/spark/pull/24005), I find that we can improve the file source v2 framework by: 1. check duplicated column names in both read and write 2. Not all the file sources support filter push down. So remove `SupportsPushDownFilters` from FileScanBuilder 3. The method `isSplitable` might require data source options. Add a new member `options` to FileScan. 4. Make `FileTable.schema` a lazy value instead of a method. ## How was this patch tested? Unit test Closes #24066 from gengliangwang/reviseFileSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 11:58:03 +08:00
Dongjoon Hyun	74d2f04183	[SPARK-27166][SQL] Improve `printSchema` to print up to the given level ## What changes were proposed in this pull request? This PR aims to improve `printSchema` to be able to print up to the given level of the schema. ```scala scala> val df = Seq((1,(2,(3,4)))).toDF df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: struct<_1: int, _2: int>>] scala> df.printSchema root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) \| \| \|-- _1: integer (nullable = false) \| \| \|-- _2: integer (nullable = false) scala> df.printSchema(1) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) scala> df.printSchema(2) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) scala> df.printSchema(3) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) \| \| \|-- _1: integer (nullable = false) \| \| \|-- _2: integer (nullable = false) ``` ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24098 from dongjoon-hyun/SPARK-27166. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 20:27:55 -07:00
Dongjoon Hyun	f26a1f3d37	[SPARK-27165][SPARK-27107][BUILD][SQL] Upgrade Apache ORC to 1.5.5 ## What changes were proposed in this pull request? This PR aims to update Apache ORC dependency to fix [SPARK-27107](https://issues.apache.org/jira/browse/SPARK-27107) . ``` [ORC-452] Support converting MAP column from JSON to ORC Improvement [ORC-447] Change the docker scripts to keep a persistent m2 cache [ORC-463] Add `version` command [ORC-475] ORC reader should lazily get filesystem [ORC-476] Make SearchAgument kryo buffer size configurable ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #24096 from dongjoon-hyun/SPARK-27165. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 20:14:31 -07:00
Holden Karau	ce89d09bdf	[SPARK-26343][K8S] Try to speed up running local k8s integration tests Speed up running k8s integration tests locally by allowing folks to skip the tgz dist build and extraction Run tests locally without a distribution of Spark, just a local build Closes #23380 from holdenk/SPARK-26343-Speed-up-running-the-kubernetes-integration-tests-locally. Authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-14 19:39:48 -07:00
Gengliang Wang	6d22ee3969	[SPARK-27136][SQL] Remove data source option check_files_exist ## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 10:19:26 +08:00
Dave DeCaprio	8819eaba4d	[SPARK-26917][SQL] Further reduce locks in CacheManager ## What changes were proposed in this pull request? Further load increases in our production environment have shown that even the read locks can cause some contention, since they contain a mechanism that turns a read lock into an exclusive lock if a writer has been starved out. This PR reduces the potential for lock contention even further than https://github.com/apache/spark/pull/23833. Additionally, it uses more idiomatic scala than the previous implementation. cloud-fan & gatorsmile This is a relatively minor improvement to the previous CacheManager changes. At this point, I think we finally are doing the minimum possible amount of locking. ## How was this patch tested? Has been tested on a live system where the blocking was causing major issues and it is working well. CacheManager has no explicit unit test but is used in many places internally as part of the SharedState. Closes #24028 from DaveDeCaprio/read-locks-master. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 10:13:34 +08:00
fitermay	21db4336b0	[SPARK-27070] Fix performance bug in DefaultPartitionCoalescer When trying to coalesce a UnionRDD of two large FileScanRDDs (each with a few million partitions) into around 8k partitions the driver can stall for over an hour. Profiler shows that over 90% of the time is spent in TimSort which is invoked by `pickBin`. This patch replaces sorting with a more efficient `min` for the purpose of finding the least occupied PartitionGroup Closes #23986 from fitermay/SPARK-27070. Authored-by: fitermay <fiterman@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-14 20:13:18 -05:00
Yuming Wang	f0b6245ea4	[SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles ## What changes were proposed in this pull request? `dev/mima` and `dev/scalastyle` support dynamic reading profiles from `modules.py`. ## How was this patch tested? manual tests Closes #24089 from wangyum/SPARK-27158. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-15 08:20:42 +09:00
Shahid	8b5224097b	[SPARK-27145][MINOR] Close store in the SQLAppStatusListenerSuite after test ## What changes were proposed in this pull request? We create many stores in the SQLAppStatusListenerSuite, but we need to the close store after test. ## How was this patch tested? Existing tests Closes #24079 from shahidki31/SPARK-27145. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-14 13:08:41 -07:00
Yuming Wang	da7db9abf6	[SPARK-23749][SQL] Replace built-in Hive API (isSub/toKryo) and remove OrcProto.Type usage ## What changes were proposed in this pull request? In order to make the upgrade built-in Hive changes smaller. This pr workaround the simplest 3 API changes first. ## How was this patch tested? manual tests Closes #24018 from wangyum/SPARK-23749. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 11:41:40 -07:00
Takeshi Yamamuro	66c5cd2d9c	[SPARK-27151][SQL] ClearCacheCommand extends IgnoreCahedData to avoid plan node copys ## What changes were proposed in this pull request? In SPARK-27011, we introduced `IgnoreCahedData` to avoid plan node copys in `CacheManager`. Since `ClearCacheCommand` has no argument, it also can extend `IgnoreCahedData`. ## How was this patch tested? Pass Jenkins. Closes #24081 from maropu/SPARK-27011-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 11:36:16 -07:00
DylanGuedes	2fecc4a3fe	[SPARK-27138][TESTS][KAFKA] Remove AdminUtils calls (fixes deprecation) ## What changes were proposed in this pull request? To change calls to AdminUtils, currently used to create and delete topics in Kafka tests. With this change, it will rely on adminClient, the recommended way from now on. ## How was this patch tested? I ran all unit tests and they are fine. Since it is already good tested, I thought that changes in the API wouldn't require new tests, as long as the current tests are working fine. Closes #24071 from DylanGuedes/spark-27138. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-14 09:20:30 -05:00
Ajith	2a04de52dd	[SPARK-26152] Synchronize Worker Cleanup with Worker Shutdown ## What changes were proposed in this pull request? The race between org.apache.spark.deploy.DeployMessages.WorkDirCleanup event and org.apache.spark.deploy.worker.Worker#onStop. Here its possible that while the WorkDirCleanup event is being processed, org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor was shutdown. hence any submission after ThreadPoolExecutor will result in java.util.concurrent.RejectedExecutionException ## How was this patch tested? Manually Closes #24056 from ajithme/workercleanup. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-14 09:16:29 -05:00
Takeshi Yamamuro	bacffb8810	[SPARK-23264][SQL] Make INTERVAL keyword optional in INTERVAL clauses when ANSI mode enabled ## What changes were proposed in this pull request? This pr updated parsing rules in `SqlBase.g4` to support a SQL query below when ANSI mode enabled; ``` SELECT CAST('2017-08-04' AS DATE) + 1 days; ``` The current master cannot parse it though, other dbms-like systems support the syntax (e.g., hive and mysql). Also, the syntax is frequently used in the official TPC-DS queries. This pr added new tokens as follows; ``` YEAR \| YEARS \| MONTH \| MONTHS \| WEEK \| WEEKS \| DAY \| DAYS \| HOUR \| HOURS \| MINUTE MINUTES \| SECOND \| SECONDS \| MILLISECOND \| MILLISECONDS \| MICROSECOND \| MICROSECONDS ``` Then, it registered the keywords below as the ANSI reserved (this follows SQL-2011); ``` DAY \| HOUR \| MINUTE \| MONTH \| SECOND \| YEAR ``` ## How was this patch tested? Added tests in `SQLQuerySuite`, `ExpressionParserSuite`, and `TableIdentifierParserSuite`. Closes #20433 from maropu/SPARK-23264. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-14 10:45:29 +09:00
Jiaxin Shan	2d0b7cfe44	[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/23814 was reverted because of Jenkins integration tests failure. After minikube upgrade, Kubernetes client SDK v1.4.2 work with kubernetes v1.13. We can bring this change back. Reference: [Bump Kubernetes Client Version to 4.1.2](https://issues.apache.org/jira/browse/SPARK-26742) [Original PR against master](https://github.com/apache/spark/pull/23814) [Kubernetes client upgrade for Spark 2.4](https://github.com/apache/spark/pull/23993) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Unit Tests: ``` All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 2.343 s] [INFO] Spark Project Tags ................................. SUCCESS [ 2.039 s] [INFO] Spark Project Sketch ............................... SUCCESS [ 12.714 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 2.185 s] [INFO] Spark Project Networking ........................... SUCCESS [ 38.154 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 7.989 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 2.297 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 2.813 s] [INFO] Spark Project Core ................................. SUCCESS [38:03 min] [INFO] Spark Project ML Local Library ..................... SUCCESS [ 3.848 s] [INFO] Spark Project GraphX ............................... SUCCESS [ 56.084 s] [INFO] Spark Project Streaming ............................ SUCCESS [04:58 min] [INFO] Spark Project Catalyst ............................. SUCCESS [06:39 min] [INFO] Spark Project SQL .................................. SUCCESS [37:12 min] [INFO] Spark Project ML Library ........................... SUCCESS [18:59 min] [INFO] Spark Project Tools ................................ SUCCESS [ 0.767 s] [INFO] Spark Project Hive ................................. SUCCESS [33:45 min] [INFO] Spark Project REPL ................................. SUCCESS [01:14 min] [INFO] Spark Project Assembly ............................. SUCCESS [ 1.444 s] [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [01:12 min] [INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 6.719 s] [INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [07:00 min] [INFO] Spark Project Examples ............................. SUCCESS [ 21.805 s] [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 0.906 s] [INFO] Spark Avro ......................................... SUCCESS [ 50.486 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:32 h [INFO] Finished at: 2019-03-07T08:39:34Z [INFO] ------------------------------------------------------------------------ ``` Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24002 from Jeffwan/update_k8s_sdk_master. Authored-by: Jiaxin Shan <seedjeffwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-13 15:04:27 -07:00
Dongjoon Hyun	250946ff93	[SPARK-27123][SQL][FOLLOWUP] Use isRenaming check for limit too. ## What changes were proposed in this pull request? This is a followup for https://github.com/apache/spark/pull/24049 to reduce the scope of pattern based on the review comments. ## How was this patch tested? Pass the existing test. Closes #24082 from dongjoon-hyun/SPARK-27123-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-13 15:01:01 -07:00
Jungtaek Lim (HeartSaVioR)	733f2c0b98	[MINOR][SQL] Deduplicate huge if statements in get between specialized getters ## What changes were proposed in this pull request? This patch deduplicates the huge if statements regarding getting value between specialized getters. ## How was this patch tested? Existing UT. Closes #24016 from HeartSaVioR/MINOR-deduplicate-get-from-specialized-getters. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-13 15:52:21 -05:00
Dongjoon Hyun	3221bf4cd5	[SPARK-27034][SPARK-27123][SQL][FOLLOWUP] Update Nested Schema Pruning BM result with EC2 ## What changes were proposed in this pull request? This is a follow up PR for #23943 in order to update the benchmark result with EC2 `r3.xlarge` instance. ## How was this patch tested? N/A. (Manually compare the diff) Closes #24078 from dongjoon-hyun/SPARK-27034. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-13 20:27:10 +00:00
Wenchen Fan	2a80a4cd39	[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions ## What changes were proposed in this pull request? It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API. This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise. ## How was this patch tested? existing tests Closes #24025 from cloud-fan/option. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-14 01:23:27 +08:00
Dave DeCaprio	812ad55461	[SPARK-26103][SQL] Limit the length of debug strings for query plans ## What changes were proposed in this pull request? The PR puts in a limit on the size of a debug string generated for a tree node. Helps to fix out of memory errors when large plans have huge debug strings. In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380. AN alternative solution was proposed in #23076, but that solution doesn't address all the cases that can cause a large query. This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with #22429, #23018 and #23039. Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI. - A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans. - When plans are truncated, "..." is printed to indicate that it isn't a full plan - A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. ## How was this patch tested? Unit tests were created for the new SizeLimitedWriter. Also a unit test for TreeNode was created that checks that a long plan is correctly truncated. Closes #23169 from DaveDeCaprio/text-plan-size. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-13 09:58:43 -07:00
Wenchen Fan	d3813d8b21	[SPARK-27064][SS] create StreamingWrite at the beginning of streaming execution ## What changes were proposed in this pull request? According to the [design](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing), the life cycle of `StreamingWrite` should be the same as the read side `MicroBatch/ContinuousStream`, i.e. each run of the stream query, instead of each epoch. This PR fixes it. ## How was this patch tested? existing tests Closes #23981 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-13 19:47:54 +08:00
Liang-Chi Hsieh	f55c760df6	[SPARK-27034][SQL][FOLLOWUP] Rename ParquetSchemaPruning to SchemaPruning ## What changes were proposed in this pull request? This is a followup to #23943. This proposes to rename ParquetSchemaPruning to SchemaPruning as ParquetSchemaPruning supports both Parquet and ORC v1 now. ## How was this patch tested? Existing tests. Closes #24077 from viirya/nested-schema-pruning-orc-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-13 20:12:01 +09:00
DB Tsai	2b9ad2516e	[MINOR][BUILD] Add Scala 2.12 profile back for branch-2.4 build Closes #24074 from dbtsai/scala-2.12. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 20:08:52 -07:00
Jungtaek Lim (HeartSaVioR)	1b06cda532	[MINOR][SQL] Refactor RowEncoder to use existing (De)serializerBuildHelper methods ## What changes were proposed in this pull request? This patch proposes to reuse existing methods in (De)serializerBuildHelper in RowEncoder to achieve deduplication as well as consistent creation of serialization/deserialization of same type. ## How was this patch tested? Existing UT. Closes #24014 from HeartSaVioR/SPARK-27092. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-13 10:54:47 +08:00
Takeshi Yamamuro	1e9469bb7a	[SPARK-26976][SQL] Forbid reserved keywords as identifiers when ANSI mode is on ## What changes were proposed in this pull request? This pr added code to forbid reserved keywords as identifiers when ANSI mode is on. This is a follow-up of SPARK-26215(#23259). ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. Closes #23880 from maropu/SPARK-26976. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-13 11:20:27 +09:00
hehuiyuan	7beb464564	[MINOR][DOC] Fix the description of Pod Metadata's annotations ## What changes were proposed in this pull request? ![annotation](https://user-images.githubusercontent.com/18002496/54189638-2d551780-44ed-11e9-9efc-3691bec42130.jpg) Closes #24064 from hehuiyuan/hehuiyuan-patch-4. Authored-by: hehuiyuan <hehuiyuan@ZBMAC-C02WD3K5H.local> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 19:29:32 -05:00
Ajith	e60d8fce0b	[SPARK-27045][SQL] SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver ## What changes were proposed in this pull request? When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL string is siet via ``setJobDescription``. the SparkUI SQL tab must show SQL instead of stacktrace in case ``setJobDescription`` is set which is more useful to end user. Instead it currently shows in description column the callsite shortform which is less useful ![image](https://user-images.githubusercontent.com/22072336/53734682-aaa7d900-3eaa-11e9-957b-0e5006db417e.png) ## How was this patch tested? Manually: ![image](https://user-images.githubusercontent.com/22072336/53734657-9f54ad80-3eaa-11e9-8dc5-2b38f6970f4e.png) Closes #23958 from ajithme/sqlui. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 16:14:29 -07:00
Yuming Wang	dccf6615c3	[SPARK-27130][BUILD] Automatically select profile when executing sbt-checkstyle ## What changes were proposed in this pull request? This PR makes it automatically select profile when executing `sbt-checkstyle`. The reason for this is that `hadoop-2.7` and `hadoop-3.1` may have different `hive-thriftserver` module in the future. ## How was this patch tested? manual tests: ``` Update AbstractService.java file. export HADOOP_PROFILE=hadoop2.7 ./dev/run-tests ``` The result: ![image](https://user-images.githubusercontent.com/5399861/54197992-5337e780-4500-11e9-930c-722982cdcd45.png) Closes #24065 from wangyum/SPARK-27130. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-13 08:03:46 +09:00
Jungtaek Lim (HeartSaVioR)	f57af2286f	[MINOR][CORE] Use https for bintray spark-packages repository ## What changes were proposed in this pull request? This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code. ## How was this patch tested? Manually ran the `--package` via `./bin/spark-shell --verbose --packages "RedisLabs:spark-redis:0.3.2"` ``` ... Ivy Default Cache set to: /Users/jlim/.ivy2/cache The jars for the packages stored in: /Users/jlim/.ivy2/jars :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml RedisLabs#spark-redis added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0 confs: [default] found RedisLabs#spark-redis;0.3.2 in spark-packages found redis.clients#jedis;2.7.2 in central found org.apache.commons#commons-pool2;2.3 in central downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ... [SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms) downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ... [SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms) downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ... [SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms) :: resolution report :: resolve 4586ms :: artifacts dl 1555ms :: modules in use: RedisLabs#spark-redis;0.3.2 from spark-packages in [default] org.apache.commons#commons-pool2;2.3 from central in [default] redis.clients#jedis;2.7.2 from central in [default] --------------------------------------------------------------------- \| \| modules \|\| artifacts \| \| conf \| number\| search\|dwnlded\|evicted\|\| number\|dwnlded\| --------------------------------------------------------------------- \| default \| 3 \| 3 \| 3 \| 0 \|\| 3 \| 3 \| --------------------------------------------------------------------- ``` Closes #24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 18:01:16 -05:00
Liang-Chi Hsieh	b0c2b3bfd9	[SPARK-27034][SQL] Nested schema pruning for ORC ## What changes were proposed in this pull request? We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too. Note: This only covers ORC v1. For ORC v2, the necessary change is at the schema pruning rule. We should deal with ORC v2 as a TODO item, in order to reduce review burden. ## How was this patch tested? Added tests. Closes #23943 from viirya/nested-schema-pruning-orc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 15:39:16 -07:00
Dongjoon Hyun	78314af580	[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample ## What changes were proposed in this pull request? `CollapseProject` optimizer rule simplifies some plans by merging the adjacent projects and performing alias substitutions. ```scala scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain == Physical Plan == (1) Project [a#5 AS c#1] +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] ``` We can do that more complex cases like the following. This PR aims to handle adjacent projects across limit/repartition/sample. Here, repartition means `Repartition`, not `RepartitionByExpression`. BEFORE* ```scala scala> sql("SELECT b c FROM (SELECT /+ REPARTITION(1) / a b FROM t)").explain == Physical Plan == (2) Project [b#0 AS c#1] +- Exchange RoundRobinPartitioning(1) +- (1) Project [a#5 AS b#0] +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] ``` AFTER ```scala scala> sql("SELECT b c FROM (SELECT /+ REPARTITION(1) / a b FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- *(1) Project [a#11 AS c#7] +- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11] ``` ## How was this patch tested? Pass the Jenkins with the newly added and updated test cases. Closes #24049 from dongjoon-hyun/SPARK-27123. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-12 21:45:40 +00:00
Liupengcheng	d5cfe08fdc	[SPARK-26927][CORE] Ensure executor is active when processing events in dynamic allocation manager. ## What changes were proposed in this pull request? There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory. What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap` of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again. For details see https://issues.apache.org/jira/browse/SPARK-26927 This PR is to fix this problem. ## How was this patch tested? existUT and added UT Closes #23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-12 13:53:42 -07:00
ankurgupta	688b0c01fa	[SPARK-26089][CORE] Handle corruption in large shuffle blocks ## What changes were proposed in this pull request? SPARK-4105 added corruption detection in shuffle blocks but that was limited to blocks which are smaller than maxBytesInFlight/3. This commit adds upon that by adding corruption check for large blocks. There are two changes/improvements that are made in this commit: 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as smaller blocks, so if a large block is corrupt in the starting, that block will be re-fetched and if that also fails, FetchFailureException will be thrown. 2. If large blocks are corrupt after size maxBytesInFlight/3, then any IOException thrown while reading the stream will be converted to FetchFailureException. This is slightly more aggressive than was originally intended but since the consumer of the stream may have already read some records and processed them, we can't just re-fetch the block, we need to fail the whole task. Additionally, we also thought about maybe adding a new type of TaskEndReason, which would re-try the task couple of times before failing the previous stage, but given the complexity involved in that solution we decided to not proceed in that direction. Thanks to squito for direction and support. ## How was this patch tested? Changed the junit test for big blocks to check for corruption. Closes #23453 from ankuriitg/ankurgupta/SPARK-26089. Authored-by: ankurgupta <ankur.gupta@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-12 14:27:44 -05:00
zuotingbing	3f9247de1e	[SPARK-27010][SQL] Find out the actual port number when hive.server2.thrift.port=0 ## What changes were proposed in this pull request? Currently, if we set hive.server2.thrift.port=0, it hard to find out the actual port number which one we should use when using beeline to connect. before: ![2019-02-28_170942](https://user-images.githubusercontent.com/24823338/53557240-779ad800-3b80-11e9-9567-175f28aa61da.png) after: ![2019-02-28_170904](https://user-images.githubusercontent.com/24823338/53557255-7f5a7c80-3b80-11e9-8ba6-9764c03e5407.png) use beeline to connect success: ![2019-02-28_170844](https://user-images.githubusercontent.com/24823338/53557267-85e8f400-3b80-11e9-90a5-f7f53a51cc32.png) ## How was this patch tested? manual tests Closes #23917 from zuotingbing/SPARK-27010. Authored-by: zuotingbing <zuo.tingbing9@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 13:38:41 -05:00

... 12 13 14 15 16 ...

24617 commits