ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
hyukjinkwon	bfe60fcdb4	[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning ## What changes were proposed in this pull request? Looks we intentionally set `null` for upper/lower bounds for complex types and don't use it. However, these look used in in-memory partition pruning, which ends up with incorrect results. This PR proposes to explicitly whitelist the supported types. ```scala val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df.cache().filter("arrayCol > array('a', 'b')").show() ``` ```scala val df = sql("select cast('a' as binary) as a") df.cache().filter("a == cast('a' as binary)").show() ``` Before: ``` +--------+ \|arrayCol\| +--------+ +--------+ ``` ``` +---+ \| a\| +---+ +---+ ``` After: ``` +--------+ \|arrayCol\| +--------+ \| [c, d]\| +--------+ ``` ``` +----+ \| a\| +----+ \|[61]\| +----+ ``` ## How was this patch tested? Unit tests were added and manually tested. Author: hyukjinkwon <gurwls223@apache.org> Closes #21882 from HyukjinKwon/stats-filter.	2018-07-30 13:20:03 +08:00
Dilip Biswal	65a4bc143a	[SPARK-21274][SQL] Implement INTERSECT ALL clause ## What changes were proposed in this pull request? Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Input Query ``` SQL SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_row(min_count, c1) FROM ( SELECT c1, IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count FROM ( SELECT c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt FROM ( SELECT c1, true as vcol1, null as vcol2 FROM ut1 UNION ALL SELECT c1, null as vcol1, true as vcol2 FROM ut2 ) AS union_all GROUP BY c1 HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1 ) ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21886 from dilipbiswal/dkb_intersect_all_final.	2018-07-29 22:11:01 -07:00
hyukjinkwon	6690924c49	[MINOR] Avoid the 'latest' link that might vary per release in functions.scala's comment ## What changes were proposed in this pull request? This PR propose to address https://github.com/apache/spark/pull/21318#discussion_r187843125 comment. This is rather a nit but looks we better avoid to update the link for each release since it always points the latest (it doesn't look like worth enough updating release guide on the other hand as well). ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@apache.org> Closes #21907 from HyukjinKwon/minor-fix.	2018-07-30 10:02:29 +08:00
hyukjinkwon	3210121fed	[MINOR][BUILD] Remove -Phive-thriftserver profile within appveyor.yml ## What changes were proposed in this pull request? This PR propose to remove `-Phive-thriftserver` profile which seems not affecting the SparkR tests in AppVeyor. Originally wanted to check if there's a meaningful build time decrease but seems not. It will have but seems not meaningfully decreased. ## How was this patch tested? AppVeyor tests: ``` [00:40:49] Attaching package: 'SparkR' [00:40:49] [00:40:49] The following objects are masked from 'package:testthat': [00:40:49] [00:40:49] describe, not [00:40:49] [00:40:49] The following objects are masked from 'package:stats': [00:40:49] [00:40:49] cov, filter, lag, na.omit, predict, sd, var, window [00:40:49] [00:40:49] The following objects are masked from 'package:base': [00:40:49] [00:40:49] as.data.frame, colnames, colnames<-, drop, endsWith, intersect, [00:40:49] rank, rbind, sample, startsWith, subset, summary, transform, union [00:40:49] [00:40:49] Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:41:43] basic tests for CRAN: ............. [00:41:43] [00:41:43] DONE =========================================================================== [00:41:43] binary functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:05] ........... [00:42:05] functions on binary files: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:10] .... [00:42:10] broadcast variables: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:12] .. [00:42:12] functions in client.R: ..... [00:42:30] test functions in sparkR.R: .............................................. [00:42:30] include R packages: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:31] [00:42:31] JVM API: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:42:31] .. [00:42:31] MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:48:48] ...................................................................... [00:48:48] MLlib clustering algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:12] ..................................................................... [00:50:12] MLlib frequent pattern mining: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:18] ..... [00:50:18] MLlib recommendation algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:50:27] ........ [00:50:27] MLlib regression algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:56:00] ................................................................................................................................ [00:56:00] MLlib statistics algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:56:04] ........ [00:56:04] MLlib tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:58:20] .............................................................................................. [00:58:20] parallelize() and collect(): Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [00:58:20] ............................. [00:58:20] basic RDD functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:03:35] ............................................................................................................................................................................................................................................................................................................................................................................................................................................ [01:03:35] SerDe functionality: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:03:39] ............................... [01:03:39] partitionBy, groupByKey, reduceByKey etc.: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:04:20] .................... [01:04:20] functions in sparkR.R: .... [01:04:20] SparkSQL functions: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:04:50] ........................................................................................................................................-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:04:50] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:50] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:04:51] -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:04:51] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:13] ............................................................................................................................................................................................................................................................................................................................................................-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:06:13] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:14] .-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:06:14] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:06:14] ....-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group [01:06:14] Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... [01:12:30] ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... [01:12:30] Structured Streaming: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:27] .......................................... [01:14:27] tests RDD function take(): Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:28] ................ [01:14:28] the textFile() function: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:44] ............. [01:14:44] functions in utils.R: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. [01:14:46] ............................................ [01:14:46] Windows-specific tests: . [01:14:46] [01:14:46] DONE =========================================================================== [01:15:29] Build success ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21894 from HyukjinKwon/wip-build.	2018-07-30 10:01:18 +08:00
Xingbo Jiang	3695ba5773	[MINOR][CORE][TEST] Fix afterEach() in TastSetManagerSuite and TaskSchedulerImplSuite ## What changes were proposed in this pull request? In the `afterEach()` method of both `TastSetManagerSuite` and `TaskSchedulerImplSuite`, `super.afterEach()` shall be called at the end, because it shall stop the SparkContext. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93706/testReport/org.apache.spark.scheduler/TaskSchedulerImplSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ The test failure is caused by the above reason, the newly added `barrierCoordinator` required `rpcEnv` which has been stopped before `TaskSchedulerImpl` doing cleanup. ## How was this patch tested? Existing tests. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #21908 from jiangxb1987/afterEach.	2018-07-30 09:58:28 +08:00
liulijia	2c54aae1bc	[SPARK-24809][SQL] Serializing LongToUnsafeRowMap in executor may result in data error When join key is long or int in broadcast join, Spark will use `LongToUnsafeRowMap` to store key-values of the table witch will be broadcasted. But, when `LongToUnsafeRowMap` is broadcasted to executors, and it is too big to hold in memory, it will be stored in disk. At that time, because `write` uses a variable `cursor` to determine how many bytes in `page` of `LongToUnsafeRowMap` will be write out and the `cursor` was not restore when deserializing, executor will write out nothing from page into disk. ## What changes were proposed in this pull request? Restore cursor value when deserializing. Author: liulijia <liutang123@yeah.net> Closes #21772 from liutang123/SPARK-24809.	2018-07-29 13:13:00 -07:00
Kazuaki Ishizaki	8fe5d2c393	[SPARK-24956][Build][test-maven] Upgrade maven version to 3.5.4 ## What changes were proposed in this pull request? This PR updates maven version from 3.3.9 to 3.5.4. The current build process uses mvn 3.3.9 that was release on 2015, which looks pretty old. We met [an issue](https://issues.apache.org/jira/browse/SPARK-24895) to need the maven 3.5.2 or later. The release note of the 3.5.4 is [here](https://maven.apache.org/docs/3.5.4/release-notes.html). Note version 3.4 was skipped. From [the release note of the 3.5.0](https://maven.apache.org/docs/3.5.0/release-notes.html), the followings are new features: 1. ANSI color logging for improved output visibility 1. add support for module name != artifactId in every calculated URLs (project, SCM, site): special project.directory property 1. create a slf4j-simple provider extension that supports level color rendering 1. ModelResolver interface enhancement: addition of resolveModel(Dependency) supporting version ranges ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21905 from kiszk/SPARK-24956.	2018-07-29 08:31:16 -05:00
Chris Martin	c5b8d54c61	[SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13 ## What changes were proposed in this pull request? - Update DateTimeUtilsSuite so that when testing roundtripping in daysToMillis and millisToDays multiple skipdates can be specified. - Updated test so that both new years eve 2014 and new years day 2015 are skipped for kiribati time zones. This is necessary as java versions pre 181-b13 considered new years day 2015 to be skipped while susequent versions corrected this to new years eve. ## How was this patch tested? Unit tests Author: Chris Martin <chris@cmartinit.co.uk> Closes #21901 from d80tb7/SPARK-24950_datetimeUtilsSuite_failures.	2018-07-28 10:40:10 -05:00
Xiao Li	c6a3db2fb6	[SPARK-24924][SQL][FOLLOW-UP] Add mapping for built-in Avro data source ## What changes were proposed in this pull request? Add one more test case for `com.databricks.spark.avro`. ## How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #21906 from gatorsmile/avro.	2018-07-28 13:43:32 +08:00
Li Jin	e8752095a0	[SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF ## What changes were proposed in this pull request? This PR add supports for using mixed Python UDF and Scalar Pandas UDF, in the following two cases: (1) ``` from pyspark.sql.functions import udf, pandas_udf udf('int') def f1(x): return x + 1 pandas_udf('int') def f2(x): return x + 1 df = spark.range(0, 1).toDF('v') \ .withColumn('foo', f1(col('v'))) \ .withColumn('bar', f2(col('v'))) ``` QueryPlan: ``` >>> df.explain(True) == Parsed Logical Plan == 'Project [v#2L, foo#5, f2('v) AS bar#9] +- AnalysisBarrier +- Project [v#2L, f1(v#2L) AS foo#5] +- Project [id#0L AS v#2L] +- Range (0, 1, step=1, splits=Some(4)) == Analyzed Logical Plan == v: bigint, foo: int, bar: int Project [v#2L, foo#5, f2(v#2L) AS bar#9] +- Project [v#2L, f1(v#2L) AS foo#5] +- Project [id#0L AS v#2L] +- Range (0, 1, step=1, splits=Some(4)) == Optimized Logical Plan == Project [id#0L AS v#2L, f1(id#0L) AS foo#5, f2(id#0L) AS bar#9] +- Range (0, 1, step=1, splits=Some(4)) == Physical Plan == (2) Project [id#0L AS v#2L, pythonUDF0#13 AS foo#5, pythonUDF0#14 AS bar#9] +- ArrowEvalPython [f2(id#0L)], [id#0L, pythonUDF0#13, pythonUDF0#14] +- BatchEvalPython [f1(id#0L)], [id#0L, pythonUDF0#13] +- (1) Range (0, 1, step=1, splits=4) ``` (2) ``` from pyspark.sql.functions import udf, pandas_udf udf('int') def f1(x): return x + 1 pandas_udf('int') def f2(x): return x + 1 df = spark.range(0, 1).toDF('v') df = df.withColumn('foo', f2(f1(df['v']))) ``` QueryPlan: ``` >>> df.explain(True) == Parsed Logical Plan == Project [v#21L, f2(f1(v#21L)) AS foo#46] +- AnalysisBarrier +- Project [v#21L, f1(f2(v#21L)) AS foo#39] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#32] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#25] +- Project [id#19L AS v#21L] +- Range (0, 1, step=1, splits=Some(4)) == Analyzed Logical Plan == v: bigint, foo: int Project [v#21L, f2(f1(v#21L)) AS foo#46] +- Project [v#21L, f1(f2(v#21L)) AS foo#39] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#32] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#25] +- Project [id#19L AS v#21L] +- Range (0, 1, step=1, splits=Some(4)) == Optimized Logical Plan == Project [id#19L AS v#21L, f2(f1(id#19L)) AS foo#46] +- Range (0, 1, step=1, splits=Some(4)) == Physical Plan == (2) Project [id#19L AS v#21L, pythonUDF0#50 AS foo#46] +- ArrowEvalPython [f2(pythonUDF0#49)], [id#19L, pythonUDF0#49, pythonUDF0#50] +- BatchEvalPython [f1(id#19L)], [id#19L, pythonUDF0#49] +- (1) Range (0, 1, step=1, splits=4) ``` ## How was this patch tested? New tests are added to BatchEvalPythonExecSuite and ScalarPandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #21650 from icexelloss/SPARK-24624-mix-udf.	2018-07-28 13:41:07 +08:00
Reynold Xin	6424b146c9	[MINOR] Update docs for functions.scala to make it clear not all the built-in functions are defined there The title summarizes the change. Author: Reynold Xin <rxin@databricks.com> Closes #21318 from rxin/functions.	2018-07-27 17:24:55 -07:00
Reynold Xin	34ebcc6b52	[MINOR] Improve documentation for HiveStringType's The diff should be self-explanatory. Author: Reynold Xin <rxin@databricks.com> Closes #21897 from rxin/hivestringtypedoc.	2018-07-27 15:34:06 -07:00
Dilip Biswal	10f1f19659	[SPARK-21274][SQL] Implement EXCEPT ALL clause. ## What changes were proposed in this pull request? Implements EXCEPT ALL clause through query rewrites using existing operators in Spark. In this PR, an internal UDTF (replicate_rows) is added to aid in preserving duplicate rows. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Note This proposed UDTF is kept as a internal function that is purely used to aid with this particular rewrite to give us flexibility to change to a more generalized UDTF in future. Input Query ``` SQL SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_rows(sum_val, c1) FROM ( SELECT c1, sum_val FROM ( SELECT c1, sum(vcol) AS sum_val FROM ( SELECT 1L as vcol, c1 FROM ut1 UNION ALL SELECT -1L as vcol, c1 FROM ut2 ) AS union_all GROUP BY union_all.c1 ) WHERE sum_val > 0 ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite and SetOperationSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21857 from dilipbiswal/dkb_except_all_final.	2018-07-27 13:47:33 -07:00
Hieu Huynh	5828f41a52	[SPARK-13343] speculative tasks that didn't commit shouldn't be marked as success Description Currently Speculative tasks that didn't commit can show up as success (depending on timing of commit). This is a bit confusing because that task didn't really succeed in the sense it didn't write anything. I think these tasks should be marked as KILLED or something that is more obvious to the user exactly what happened. it is happened to hit the timing where it got a commit denied exception then it shows up as failed and counts against your task failures. It shouldn't count against task failures since that failure really doesn't matter. MapReduce handles these situation so perhaps we can look there for a model. <img width="1420" alt="unknown" src="https://user-images.githubusercontent.com/15680678/42013170-99db48c2-7a61-11e8-8c7b-ef94c84e36ea.png"> How can this issue happen? When both attempts of a task finish before the driver sends command to kill one of them, both of them send the status update FINISHED to the driver. The driver calls TaskSchedulerImpl to handle one successful task at a time. When it handles the first successful task, it sends the command to kill the other copy of the task, however, because that task is already finished, the executor will ignore the command. After finishing handling the first attempt, it processes the second one, although all actions on the result of this task are skipped, this copy of the task is still marked as SUCCESS. As a result, even though this issue does not affect the result of the job, it might cause confusing to user because both of them appear to be successful. How does this PR fix the issue? The simple way to fix this issue is that when taskSetManager handles successful task, it checks if any other attempt succeeded. If this is the case, it will call handleFailedTask with state==KILLED and reason==TaskKilled(“another attempt succeeded”) to handle this task as begin killed. How was this patch tested? I tested this manually by running applications, that caused the issue before, a few times, and observed that the issue does not happen again. Also, I added a unit test in TaskSetManagerSuite to test that if we call handleSuccessfulTask to handle status update for 2 copies of a task, only the one that is handled first will be mark as SUCCESS Author: Hieu Huynh <“Hieu.huynh@oath.com”> Author: hthuynh2 <hthieu96@gmail.com> Closes #21653 from hthuynh2/SPARK_13343.	2018-07-27 12:34:14 -05:00
Karthik Palaniappan	ee5a5a0925	[SPARK-21960][STREAMING] Spark Streaming Dynamic Allocation should respect spark.executor.instances ## What changes were proposed in this pull request? Removes check that `spark.executor.instances` is set to 0 when using Streaming DRA. ## How was this patch tested? Manual tests My only concern with this PR is that `spark.executor.instances` (or the actual initial number of executors that the cluster manager gives Spark) can be outside of `spark.streaming.dynamicAllocation.minExecutors` to `spark.streaming.dynamicAllocation.maxExecutors`. I don't see a good way around that, because this code only runs after the SparkContext has been created. Author: Karthik Palaniappan <karthikpal@google.com> Closes #19183 from karth295/master.	2018-07-27 12:18:56 -05:00
Maxim Gekk	0a0f68bae6	[SPARK-24881][SQL] New Avro option - compression ## What changes were proposed in this pull request? In the PR, I added new option for Avro datasource - `compression`. The option allows to specify compression codec for saved Avro files. This option is similar to `compression` option in another datasources like `JSON` and `CSV`. Also I added the SQL configs `spark.sql.avro.compression.codec` and `spark.sql.avro.deflate.level`. I put the configs into `SQLConf`. If the `compression` option is not specified by an user, the first SQL config is taken into account. ## How was this patch tested? I added new test which read meta info from written avro files and checks `avro.codec` property. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21837 from MaxGekk/avro-compression.	2018-07-28 00:11:32 +08:00
Cheng Lian	c9bec1d371	[SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-java cannot be "provided" ## What changes were proposed in this pull request? Please see [SPARK-24927][1] for more details. [1]: https://issues.apache.org/jira/browse/SPARK-24927 ## How was this patch tested? Manually tested. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #21879 from liancheng/spark-24927. (cherry picked from commit `d5f340f277`) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-07-27 08:58:42 -07:00
pkuwm	ef6c8395c4	[SPARK-23928][SQL] Add shuffle collection function. ## What changes were proposed in this pull request? This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the "inside-out" version of Fisher-Yates algorithm. ## How was this patch tested? New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala. Author: Takuya UESHIN <ueshin@databricks.com> Author: pkuwm <ihuizhi.lu@gmail.com> Closes #21802 from ueshin/issues/SPARK-23928/shuffle.	2018-07-27 23:02:48 +09:00
maryannxue	21fcac1645	[SPARK-24288][SQL] Add a JDBC Option to enable preventing predicate pushdown ## What changes were proposed in this pull request? Add a JDBC Option "pushDownPredicate" (default `true`) to allow/disallow predicate push-down in JDBC data source. ## How was this patch tested? Add a test in `JDBCSuite` Author: maryannxue <maryannxue@apache.org> Closes #21875 from maryannxue/spark-24288.	2018-07-26 23:47:32 -07:00
Reynold Xin	e6e9031d7b	[SPARK-24865] Remove AnalysisBarrier ## What changes were proposed in this pull request? AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't re-analyze nodes that have already been analyzed). Before AnalysisBarrier, we already had some infrastructure in place, with analysis specific functions (resolveOperators and resolveExpressions). These functions do not recursively traverse down subplans that are already analyzed (with a mutable boolean flag _analyzed). The issue with the old system was that developers started using transformDown, which does a top-down traversal of the plan tree, because there was not top-down resolution function, and as a result analyzer performance became pretty bad. In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a special node and for this special node, transform/transformUp/transformDown don't traverse down. However, the introduction of this special node caused a lot more troubles than it solves. This implicit node breaks assumptions and code in a few places, and it's hard to know when analysis barrier would exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions demonstrates it is a source of bugs and additional complexity. Instead, this pull request removes AnalysisBarrier and reverts back to the old approach. We added infrastructure in tests that fail explicitly if transform methods are used in the analyzer. ## How was this patch tested? Added a test suite AnalysisHelperSuite for testing the resolve* methods and transform* methods. Author: Reynold Xin <rxin@databricks.com> Author: Xiao Li <gatorsmile@gmail.com> Closes #21822 from rxin/SPARK-24865.	2018-07-27 14:29:05 +08:00
hyukjinkwon	f9c9d80e46	[SPARK-24929][INFRA] Make merge script don't swallow KeyboardInterrupt ## What changes were proposed in this pull request? If you want to get out of the loop to assign JIRA's user by command+c (KeyboardInterrupt), I am unable to get out. I faced this problem when the user doesn't have a contributor role and I just wanted to cancel and manually take an action to the JIRA. Before: ``` JIRA is unassigned, choose assignee [0] todd.chen (Reporter) Enter number of user, or userid, to assign to (blank to leave unassigned):Traceback (most recent call last): File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee "Enter number of user, or userid, to assign to (blank to leave unassigned):") KeyboardInterrupt Error assigning JIRA, try again (or leave blank and fix manually) JIRA is unassigned, choose assignee [0] todd.chen (Reporter) Enter number of user, or userid, to assign to (blank to leave unassigned):Traceback (most recent call last): File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee "Enter number of user, or userid, to assign to (blank to leave unassigned):") KeyboardInterrupt ``` After: ``` JIRA is unassigned, choose assignee [0] Dongjoon Hyun (Reporter) Enter number of user, or userid to assign to (blank to leave unassigned):Traceback (most recent call last): File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee "Enter number of user, or userid to assign to (blank to leave unassigned):") KeyboardInterrupt Restoring head pointer to master git checkout master Already on 'master' git branch ``` ## How was this patch tested? I tested this manually (I use my own merging script with few fixes). Author: hyukjinkwon <gurwls223@apache.org> Closes #21880 from HyukjinKwon/key-error.	2018-07-27 13:29:54 +08:00
zuotingbing	dc3713cca2	[SPARK-24829][STS] In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql ## What changes were proposed in this pull request? SELECT CAST('4.56' AS FLOAT) the result is 4.559999942779541 ![2018-07-18_110944](https://user-images.githubusercontent.com/24823338/42857199-7c6783da-8a7b-11e8-8c69-1e9302102525.png) it should be 4.56 as same as in spark-shell or spark-sql. ![2018-07-18_111111](https://user-images.githubusercontent.com/24823338/42857210-80c89e96-8a7b-11e8-9f8c-de1a79a73752.png) ## How was this patch tested? add unit tests Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #21789 from zuotingbing/SPARK-24829.	2018-07-27 13:27:17 +08:00
Misha Dmitriev	094aa59715	[SPARK-24801][CORE] Avoid memory waste by empty byte[] arrays in SaslEncryption$EncryptedMessage ## What changes were proposed in this pull request? Initialize SaslEncryption$EncryptedMessage.byteChannel lazily, so that empty, not yet used instances of ByteArrayWritableChannel referenced by this field don't use up memory. I analyzed a heap dump from Yarn Node Manager where this code is used, and found that there are over 40,000 of the above objects in memory, each with a big empty byte[] array. The reason they are all there is because of Netty queued up a large number of messages in memory before transferTo() is called. There is a small number of netty ChannelOutboundBuffer objects, and then collectively , via linked lists starting from their flushedEntry data fields, they end up referencing over 40K ChannelOutboundBuffer$Entry objects, which ultimately reference EncryptedMessage objects. ## How was this patch tested? Ran all the tests locally. Author: Misha Dmitriev <misha@cloudera.com> Closes #21811 from countmdm/misha/spark-24801.	2018-07-26 22:15:12 -05:00
Gengliang Wang	fa09d91925	[SPARK-24919][BUILD] New linter rule for sparkContext.hadoopConfiguration ## What changes were proposed in this pull request? In most cases, we should use `spark.sessionState.newHadoopConf()` instead of `sparkContext.hadoopConfiguration`, so that the hadoop configurations specified in Spark session configuration will come into effect. Add a rule matching `spark.sparkContext.hadoopConfiguration` or `spark.sqlContext.sparkContext.hadoopConfiguration` to prevent the usage. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21873 from gengliangwang/linterRule.	2018-07-26 16:50:59 -07:00
Imran Rashid	2c82745686	[SPARK-24307][CORE] Add conf to revert to old code. In case there are any issues in converting FileSegmentManagedBuffer to ChunkedByteBuffer, add a conf to go back to old code path. Followup to `7e847646d1` Author: Imran Rashid <irashid@cloudera.com> Closes #21867 from squito/SPARK-24307-p2.	2018-07-26 12:13:27 -07:00
Xingbo Jiang	e3486e1b95	[SPARK-24795][CORE] Implement barrier execution mode ## What changes were proposed in this pull request? Propose new APIs and modify job/task scheduling to support barrier execution mode, which requires all tasks in a same barrier stage start at the same time, and retry all tasks in case some tasks fail in the middle. The barrier execution mode is useful for some ML/DL workloads. The proposed API changes include: - `RDDBarrier` that marks an RDD as barrier (Spark must launch all the tasks together for the current stage). - `BarrierTaskContext` that support global sync of all tasks in a barrier stage, and provide extra `BarrierTaskInfo`s. In DAGScheduler, we retry all tasks of a barrier stage in case some tasks fail in the middle, this is achieved by unregistering map outputs for a shuffleId (for ShuffleMapStage) or clear the finished partitions in an active job (for ResultStage). ## How was this patch tested? Add `RDDBarrierSuite` to ensure we convert RDDs correctly; Add new test cases in `DAGSchedulerSuite` to ensure we do task scheduling correctly; Add new test cases in `SparkContextSuite` to ensure the barrier execution mode actually works (both under local mode and local cluster mode). Add new test cases in `TaskSchedulerImplSuite` to ensure we schedule tasks for barrier taskSet together. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #21758 from jiangxb1987/barrier-execution-mode.	2018-07-26 12:09:01 -07:00
maryannxue	5ed7660d14	[SPARK-24802][SQL][FOLLOW-UP] Add a new config for Optimization Rule Exclusion ## What changes were proposed in this pull request? This is an extension to the original PR, in which rule exclusion did not work for classes derived from Optimizer, e.g., SparkOptimizer. To solve this issue, Optimizer and its derived classes will define/override `defaultBatches` and `nonExcludableRules` in order to define its default rule set as well as rules that cannot be excluded by the SQL config. In the meantime, Optimizer's `batches` method is dedicated to the rule exclusion logic and is defined "final". ## How was this patch tested? Added UT. Author: maryannxue <maryannxue@apache.org> Closes #21876 from maryannxue/rule-exclusion.	2018-07-26 11:06:23 -07:00
Dongjoon Hyun	58353d7f4b	[SPARK-24924][SQL] Add mapping for built-in Avro data source ## What changes were proposed in this pull request? This PR aims to the followings. 1. Like `com.databricks.spark.csv` mapping, we had better map `com.databricks.spark.avro` to built-in Avro data source. 2. Remove incorrect error message, `Please find an Avro package at ...`. ## How was this patch tested? Pass the newly added tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21878 from dongjoon-hyun/SPARK-24924.	2018-07-26 16:11:03 +08:00
Takuya UESHIN	c9b233d414	[SPARK-24878][SQL] Fix reverse function for array type of primitive type containing null. ## What changes were proposed in this pull request? If we use `reverse` function for array type of primitive type containing `null` and the child array is `UnsafeArrayData`, the function returns a wrong result because `UnsafeArrayData` doesn't define the behavior of re-assignment, especially we can't set a valid value after we set `null`. ## How was this patch tested? Added some tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21830 from ueshin/issues/SPARK-24878/fix_reverse.	2018-07-26 15:06:13 +08:00
Xiao Li	d2e7deb59f	[SPARK-24867][SQL] Add AnalysisBarrier to DataFrameWriter ## What changes were proposed in this pull request? ```Scala val udf1 = udf({(x: Int, y: Int) => x + y}) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1($"a", udf1($"a", lit(10)))) df.cache() df.write.saveAsTable("t") ``` Cache is not being used because the plans do not match with the cached plan. This is a regression caused by the changes we made in AnalysisBarrier, since not all the Analyzer rules are idempotent. ## How was this patch tested? Added a test. Also found a bug in the DSV1 write path. This is not a regression. Thus, opened a separate JIRA https://issues.apache.org/jira/browse/SPARK-24869 Author: Xiao Li <gatorsmile@gmail.com> Closes #21821 from gatorsmile/testMaster22.	2018-07-25 17:22:37 -07:00
Koert Kuipers	17f469bc80	[SPARK-24860][SQL] Support setting of partitionOverWriteMode in output options for writing DataFrame ## What changes were proposed in this pull request? Besides spark setting spark.sql.sources.partitionOverwriteMode also allow setting partitionOverWriteMode per write ## How was this patch tested? Added unit test in InsertSuite Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Koert Kuipers <koert@tresata.com> Closes #21818 from koertkuipers/feat-partition-overwrite-mode-per-write.	2018-07-25 13:06:03 -07:00
mcheah	0c83f718ee	[SPARK-23146][K8S][TESTS] Enable client mode integration test. ## What changes were proposed in this pull request? Enable client mode integration test after merging from master. ## How was this patch tested? Check the integration test runs in the build. Author: mcheah <mcheah@palantir.com> Closes #21874 from mccheah/enable-client-mode-test.	2018-07-25 12:10:23 -07:00
Maxim Gekk	2f77616e1d	[SPARK-24849][SPARK-24911][SQL] Converting a value of StructType to a DDL string ## What changes were proposed in this pull request? In the PR, I propose to extend the `StructType`/`StructField` classes by new method `toDDL` which converts a value of the `StructType`/`StructField` type to a string formatted in DDL style. The resulted string can be used in a table creation. The `toDDL` method of `StructField` is reused in `SHOW CREATE TABLE`. In this way the PR fixes the bug of unquoted names of nested fields. ## How was this patch tested? I add a test for checking the new method and 2 round trip tests: `fromDDL` -> `toDDL` and `toDDL` -> `fromDDL` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21803 from MaxGekk/to-ddl.	2018-07-25 11:09:12 -07:00
mcheah	571a6f0574	[SPARK-23146][K8S] Support client mode. ## What changes were proposed in this pull request? Support client mode for the Kubernetes scheduler. Client mode works more or less identically to cluster mode. However, in client mode, the Spark Context needs to be manually bootstrapped with certain properties which would have otherwise been set up by spark-submit in cluster mode. Specifically: - If the user doesn't provide a driver pod name, we don't add an owner reference. This is for usage when the driver is not running in a pod in the cluster. In such a case, the driver can only provide a best effort to clean up the executors when the driver exits, but cleaning up the resources is not guaranteed. The executor JVMs should exit if the driver JVM exits, but the pods will still remain in the cluster in a COMPLETED or FAILED state. - The user must provide a host (spark.driver.host) and port (spark.driver.port) that the executors can connect to. When using spark-submit in cluster mode, spark-submit generates the headless service automatically; in client mode, the user is responsible for setting up their own connectivity. We also change the authentication configuration prefixes for client mode. ## How was this patch tested? Adding an integration test to exercise client mode support. Author: mcheah <mcheah@palantir.com> Closes #21748 from mccheah/k8s-client-mode.	2018-07-25 11:08:41 -07:00
Gengliang Wang	c44eb561ec	[SPARK-24768][FOLLOWUP][SQL] Avro migration followup: change artifactId to spark-avro ## What changes were proposed in this pull request? After rethinking on the artifactId, I think it should be `spark-avro` instead of `spark-sql-avro`, which is simpler, and consistent with the previous artifactId. I think we need to change it before Spark 2.4 release. Also a tiny change: use `spark.sessionState.newHadoopConf()` to get the hadoop configuration, thus the related hadoop configurations in SQLConf will come into effect. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21866 from gengliangwang/avro_followup.	2018-07-25 08:42:45 -07:00
Yuming Wang	7a5fd4a91e	[SPARK-18874][SQL][FOLLOW-UP] Improvement type mismatched message ## What changes were proposed in this pull request? Improvement `IN` predicate type mismatched message: ```sql Mismatched columns: [(, t, 4, ., `, t, 4, a, `, :, d, o, u, b, l, e, ,, , t, 5, ., `, t, 5, a, `, :, d, e, c, i, m, a, l, (, 1, 8, ,, 0, ), ), (, t, 4, ., `, t, 4, c, `, :, s, t, r, i, n, g, ,, , t, 5, ., `, t, 5, c, `, :, b, i, g, i, n, t, )] ``` After this patch: ```sql Mismatched columns: [(t4.`t4a`:double, t5.`t5a`:decimal(18,0)), (t4.`t4c`:string, t5.`t5c`:bigint)] ``` ## How was this patch tested? unit tests Author: Yuming Wang <yumwang@ebay.com> Closes #21863 from wangyum/SPARK-18874.	2018-07-24 23:59:13 -07:00
crafty-coder	78e0a725e0	[SPARK-19018][SQL] Add support for custom encoding on csv writer ## What changes were proposed in this pull request? Add support for custom encoding on csv writer, see https://issues.apache.org/jira/browse/SPARK-19018 ## How was this patch tested? Added two unit tests in CSVSuite Author: crafty-coder <carlospb86@gmail.com> Author: Carlos <crafty-coder@users.noreply.github.com> Closes #20949 from crafty-coder/master.	2018-07-25 14:17:20 +08:00
Dilip Biswal	afb0627536	[SPARK-23957][SQL] Sorts in subqueries are redundant and can be removed ## What changes were proposed in this pull request? Thanks to henryr for the original idea at https://github.com/apache/spark/pull/21049 Description from the original PR : Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering them is therefore redundant (unless combined with a limit). This patch removes the top sort operators from the subquery plans. This closes https://github.com/apache/spark/pull/21049. ## How was this patch tested? Added test cases in SubquerySuite to cover in, exists and scalar subqueries. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21853 from dilipbiswal/SPARK-23957.	2018-07-24 20:46:27 -07:00
DB Tsai	d4c3415894	[SPARK-24890][SQL] Short circuiting the `if` condition when `trueValue` and `falseValue` are the same ## What changes were proposed in this pull request? When `trueValue` and `falseValue` are semantic equivalence, the condition expression in `if` can be removed to avoid extra computation in runtime. ## How was this patch tested? Test added. Author: DB Tsai <d_tsai@apple.com> Closes #21848 from dbtsai/short-circuit-if.	2018-07-24 20:21:11 -07:00
maryannxue	c26b092169	[SPARK-24891][SQL] Fix HandleNullInputsForUDF rule ## What changes were proposed in this pull request? The HandleNullInputsForUDF would always add a new `If` node every time it is applied. That would cause a difference between the same plan being analyzed once and being analyzed twice (or more), thus raising issues like plan not matched in the cache manager. The solution is to mark the arguments as null-checked, which is to add a "KnownNotNull" node above those arguments, when adding the UDF under an `If` node, because clearly the UDF will not be called when any of those arguments is null. ## How was this patch tested? Add new tests under sql/UDFSuite and AnalysisSuite. Author: maryannxue <maryannxue@apache.org> Closes #21851 from maryannxue/spark-24891.	2018-07-24 19:35:34 -07:00
Imran Rashid	15fff79032	[SPARK-24297][CORE] Fetch-to-disk by default for > 2gb Fetch-to-mem is guaranteed to fail if the message is bigger than 2 GB, so we might as well use fetch-to-disk in that case. The message includes some metadata in addition to the block data itself (in particular UploadBlock has a lot of metadata), so we leave a little room. Author: Imran Rashid <irashid@cloudera.com> Closes #21474 from squito/SPARK-24297.	2018-07-25 09:08:42 +08:00
shane knapp	3efdf35327	[SPARK-24908][R][STYLE] removing spaces to make lintr happy ## What changes were proposed in this pull request? during my travails in porting spark builds to run on our centos worker, i managed to recreate (as best i could) the centos environment on our new ubuntu-testing machine. while running my initial builds, lintr was crashing on some extraneous spaces in test_basic.R (see: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console) after removing those spaces, the ubuntu build happily passed the lintr tests. ## How was this patch tested? i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build (see https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/), which scp'ed a copy of test_basic.R in to the repo after the git clone. everything seems to be working happily. Author: shane knapp <incomplete@gmail.com> Closes #21864 from shaneknapp/fixing-R-lint-spacing.	2018-07-24 16:13:57 -07:00
Eric Chang	fc21f192a3	[SPARK-24895] Remove spotbugs plugin ## What changes were proposed in this pull request? Spotbugs maven plugin was a recently added plugin before 2.4.0 snapshot artifacts were broken. To ensure it does not affect the maven deploy plugin, this change removes it. ## How was this patch tested? Local build was ran, but this patch will be actually tested by monitoring the apache repo artifacts and making sure metadata is correctly uploaded after this job is ran: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ Author: Eric Chang <eric.chang@databricks.com> Closes #21865 from ericfchang/SPARK-24895.	2018-07-24 15:53:50 -07:00
s71955	d4a277f0ce	[SPARK-24812][SQL] Last Access Time in the table description is not valid ## What changes were proposed in this pull request? Last Access Time will always displayed wrong date Thu Jan 01 05:30:00 IST 1970 when user run DESC FORMATTED table command In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong date. seems to be a limitation as of now even from hive, better we can follow the hive behavior unless the limitation has been resolved from hive. spark client output ![spark_desc table](https://user-images.githubusercontent.com/12999161/42753448-ddeea66a-88a5-11e8-94aa-ef8d017f94c5.png) Hive client output ![hive_behaviour](https://user-images.githubusercontent.com/12999161/42753489-f4fd366e-88a5-11e8-83b0-0f3a53ce83dd.png) ## How was this patch tested? UT has been added which makes sure that the wrong date "Thu Jan 01 05:30:00 IST 1970 " shall not be added as value for the Last Access property Author: s71955 <sujithchacko.2010@gmail.com> Closes #21775 from sujith71955/master_hive.	2018-07-24 11:31:27 -07:00
Ryan Blue	9d27541a85	[SPARK-23325] Use InternalRow when reading with DataSourceV2. ## What changes were proposed in this pull request? This updates the DataSourceV2 API to use InternalRow instead of Row for the default case with no scan mix-ins. Support for readers that produce Row is added through SupportsDeprecatedScanRow, which matches the previous API. Readers that used Row now implement this class and should be migrated to InternalRow. Readers that previously implemented SupportsScanUnsafeRow have been migrated to use no SupportsScan mix-ins and produce InternalRow. ## How was this patch tested? This uses existing tests. Author: Ryan Blue <blue@apache.org> Closes #21118 from rdblue/SPARK-23325-datasource-v2-internal-row.	2018-07-24 10:46:36 -07:00
hyukjinkwon	3d5c61e5fd	[SPARK-22499][FOLLOWUP][SQL] Reduce input string expressions for Least and Greatest to reduce time in its test ## What changes were proposed in this pull request? It's minor and trivial but looks 2000 input is good enough to reproduce and test in SPARK-22499. ## How was this patch tested? Manually brought the change and tested. Locally tested: Before: 3m 21s 288ms After: 1m 29s 134ms Given the latest successful build took: ``` ArithmeticExpressionSuite: - SPARK-22499: Least and greatest should not generate codes beyond 64KB (7 minutes, 49 seconds) ``` I expect it's going to save 4ish mins. Author: hyukjinkwon <gurwls223@apache.org> Closes #21855 from HyukjinKwon/minor-fix-suite.	2018-07-24 19:51:09 +08:00
10129659	13a67b070d	[SPARK-24870][SQL] Cache can't work normally if there are case letters in SQL ## What changes were proposed in this pull request? Modified the canonicalized to not case-insensitive. Before the PR, cache can't work normally if there are case letters in SQL, for example: sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive") sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " + "from src group by key").cache().createOrReplaceTempView("src_cache") sql( s"""select a.key from (select key from src_cache where positiveNum = 1)a left join (select key from src_cache )b on a.key=b.key """).explain The physical plan of the sql is: ![image](https://user-images.githubusercontent.com/26834091/42979518-3decf0fa-8c05-11e8-9837-d5e4c334cb1f.png) The subquery "select key from src_cache where positiveNum = 1" on the left of join can use the cache data, but the subquery "select key from src_cache" on the right of join cannot use the cache data. ## How was this patch tested? new added test Author: 10129659 <chen.yanshan@zte.com.cn> Closes #21823 from eatoncys/canonicalized.	2018-07-23 23:05:08 -07:00
“attilapiros”	d2436a8529	[SPARK-24594][YARN] Introducing metrics for YARN ## What changes were proposed in this pull request? In this PR metrics are introduced for YARN. As up to now there was no metrics in the YARN module a new metric system is created with the name "applicationMaster". To support both client and cluster mode the metric system lifecycle is bound to the AM. ## How was this patch tested? Both client and cluster mode was tested manually. Before the test on one of the YARN node spark-core was removed to cause the allocation failure. Spark was started as (in case of client mode): ``` spark2-submit \ --class org.apache.spark.examples.SparkPi \ --conf "spark.yarn.blacklist.executor.launch.blacklisting.enabled=true" --conf "spark.blacklist.application.maxFailedExecutorsPerNode=2" --conf "spark.dynamicAllocation.enabled=true" --conf "spark.metrics.conf.*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink" \ --master yarn \ --deploy-mode client \ original-spark-examples_2.11-2.4.0-SNAPSHOT.jar \ 1000 ``` In both cases the YARN logs contained the new metrics as: ``` $ yarn logs --applicationId application_1529926424933_0015 ... -- Gauges ---------------------------------------------------------------------- application_1531751594108_0046.applicationMaster.numContainersPendingAllocate value = 0 application_1531751594108_0046.applicationMaster.numExecutorsFailed value = 3 application_1531751594108_0046.applicationMaster.numExecutorsRunning value = 9 application_1531751594108_0046.applicationMaster.numLocalityAwareTasks value = 0 application_1531751594108_0046.applicationMaster.numReleasedContainers value = 0 ... ``` Author: “attilapiros” <piros.attila.zsolt@gmail.com> Author: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Closes #21635 from attilapiros/SPARK-24594.	2018-07-24 09:33:10 +08:00
Yuanjian Li	cfc3e1aaa4	[SPARK-24339][SQL] Prunes the unused columns from child of ScriptTransformation ## What changes were proposed in this pull request? Modify the strategy in ColumnPruning to add a Project between ScriptTransformation and its child, this strategy can reduce the scan time especially in the scenario of the table has many columns. ## How was this patch tested? Add UT in ColumnPruningSuite and ScriptTransformationSuite. Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21839 from xuanyuanking/SPARK-24339.	2018-07-23 13:04:39 -07:00
Tathagata Das	61f0ca4f1c	[SPARK-24699][SS] Make watermarks work with Trigger.Once by saving updated watermark to commit log ## What changes were proposed in this pull request? Streaming queries with watermarks do not work with Trigger.Once because of the following. - Watermark is updated in the driver memory after a batch completes, but it is persisted to checkpoint (in the offset log) only when the next batch is planned - In trigger.once, the query terminated as soon as one batch has completed. Hence, the updated watermark is never persisted anywhere. The simple solution is to persist the updated watermark value in the commit log when a batch is marked as completed. Then the next batch, in the next trigger.once run can pick it up from the commit log. ## How was this patch tested? new unit tests Co-authored-by: Tathagata Das <tathagata.das1565gmail.com> Co-authored-by: c-horn <chorn4033gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21746 from tdas/SPARK-24699.	2018-07-23 13:03:32 -07:00

1 2 3 4 5 ...

22359 commits