ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	e946104c42	[SPARK-29400][CORE] Improve PrometheusResource to use labels ### What changes were proposed in this pull request? [SPARK-29064](https://github.com/apache/spark/pull/25770) introduced `PrometheusResource` to expose `ExecutorSummary`. This PR aims to improve it further more `Prometheus`-friendly to use [Prometheus labels](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels). ### Why are the changes needed? BEFORE ``` metrics_app_20191008151432_0000_driver_executor_rddBlocks_Count 0 metrics_app_20191008151432_0000_driver_executor_memoryUsed_Count 0 metrics_app_20191008151432_0000_driver_executor_diskUsed_Count 0 ``` AFTER ``` $ curl -s http://localhost:4040/metrics/executors/prometheus/ \| head -n3 metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 ``` ### Does this PR introduce any user-facing change? No, but `Prometheus` understands the new format and shows more intelligently. <img width="735" alt="ui" src="https://user-images.githubusercontent.com/9700541/66438279-1756f900-e9e1-11e9-91c7-c04c6ce9172f.png"> ### How was this patch tested? Manually. SETUP ``` $ sbin/start-master.sh $ sbin/start-slave.sh spark://`hostname`:7077 $ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true ``` RESULT ``` $ curl -s http://localhost:4040/metrics/executors/prometheus/ \| head -n3 metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0 ``` Closes #26060 from dongjoon-hyun/SPARK-29400. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-10 08:47:12 -07:00
Gengliang Wang	6edabeb0ee	[SPARK-28989][SQL][FOLLOWUP] Update ANSI mode related config names in comments ### What changes were proposed in this pull request? Update ANSI mode related config names in comments as "spark.sql.ansi.enabled" ### Why are the changes needed? The removed configuration `spark.sql.parser.ansi.enabled` and `spark.sql.failOnIntegralTypeOverflow` still exist in code comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Grep the whole code to ensure the remove config names no longer exist. ``` git grep "parser.ansi.enabled" git grep failOnIntegralTypeOverflow git grep decimalOperationsNullOnOverflow git grep ANSI_SQL_PARSER git grep FAIL_ON_INTEGRAL_TYPE_OVERFLOW git grep DECIMAL_OPERATIONS_NULL_ON_OVERFLOW ``` Closes #26067 from gengliangwang/spark-28989-followup. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 19:17:54 -07:00
HyukjinKwon	7ba16ffbb9	[SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now ### What changes were proposed in this pull request? This PR proposes to use Arrow R 0.14.1 for now in AppVeyor to make tests passed. ### Why are the changes needed? To make build passed with Arrow. It doesn't work with setting `ARROW_PRE_0_15_IPC_FORMAT` to `1` to allow Arrow R 0.15 compatibility. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? AppVeyor Closes #26041 from HyukjinKwon/investigate. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-10 09:01:36 +09:00
Sean Owen	cc7493fa21	[SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) ### What changes were proposed in this pull request? Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place. ### Why are the changes needed? To compile with 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26073 from srowen/SPARK-29416. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 17:00:48 -07:00
Sean Owen	4d93fb7dfa	[SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 ### What changes were proposed in this pull request? Rewrite declaration of internal `ThreadUtils.parmap` method to avoid `TraversableLike`, which is removed in Scala 2.13. ### Why are the changes needed? To compile in Scala 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26072 from srowen/SPARK-29413. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:53:51 -07:00
Sean Owen	3b0bca42ac	[SPARK-29401][FOLLOWUP] Additional cases where a .parallelize call with Array is ambiguous in 2.13 This is just a followup on https://github.com/apache/spark/pull/26062 -- see it for more detail. I think we will eventually find more cases of this. It's hard to get them all at once as there are many different types of compile errors in earlier modules. I'm trying to address them in as a big a chunk as possible. Closes #26074 from srowen/SPARK-29401.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:27:05 -07:00
Sean Owen	fa95a5c395	[SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 ### What changes were proposed in this pull request? Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object. ### Why are the changes needed? It doesn't compile otherwise in Scala 2.13. - https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30 ### Does this PR introduce any user-facing change? Should be no behavior change at all. ### How was this patch tested? Existing tests. Closes #26070 from srowen/SPARK-29411. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:24:13 -07:00
herman	ba4d413fc9	[SPARK-29346][SQL] Add Aggregating Accumulator ### What changes were proposed in this pull request? This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time. The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output. Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`. ### Why are the changes needed? This is the mechanism we are going to use to implement observable metrics. ### Does this PR introduce any user-facing change? No, not yet. ### How was this patch tested? Added `AggregatingAccumulator` test suite. Closes #26012 from hvanhovell/SPARK-29346. Authored-by: herman <herman@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-10-09 16:05:14 +02:00
Maxim Gekk	c97b3ed279	[SPARK-24640][SQL][FOLLOWUP] Update the SQL migration guide about `size(NULL)` ### What changes were proposed in this pull request? The commit `4e6d31f570` changed default behavior of `size()` for the `NULL` input. In this PR, I propose to update the SQL migration guide. ### Why are the changes needed? To inform users about new behavior of the `size()` function for the `NULL` input. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26066 from MaxGekk/size-null-migration-guide. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-09 16:37:35 +08:00
Terry Kim	a927f1aefc	[SPARK-29373][SQL] DataSourceV2: Commands should not submit a spark job ### What changes were proposed in this pull request? DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all extend LeafExecNode. This results in running a job when executeCollect() is called. This breaks the previous behavior [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650). A new command physical operator will be introduced form which all V2 Exec classes derive to avoid running a job. ### Why are the changes needed? It is a bug since the current behavior runs a spark job, which breaks the existing behavior: [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #26048 from imback82/dsv2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-09 11:44:25 +08:00
Sean Owen	ee83d09b53	[SPARK-29401][CORE][ML][SQL][GRAPHX][TESTS] Replace calls to .parallelize Arrays of tuples, ambiguous in Scala 2.13, with Seqs of tuples ### What changes were proposed in this pull request? Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like: ``` [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives: (x: Unit,xs: Unit)Array[Unit] <and> (x: Double,xs: Double)Array[Double] <and> (x: Float,xs: Float)Array[Float] <and> (x: Long,xs: Long)Array[Long] <and> (x: Int,xs: Int)Array[Int] <and> (x: Char,xs: Char)Array[Char] <and> (x: Short,xs: Short)Array[Short] <and> (x: Byte,xs: Byte)Array[Byte] <and> (x: Boolean,xs: Boolean*)Array[Boolean] cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) ``` Using a `Seq` instead appears to resolve it, and is effectively equivalent. ### Why are the changes needed? To better cross-build for 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26062 from srowen/SPARK-29401. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:22:02 -07:00
Sean Owen	2d871ad0e7	[SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo") ### What changes were proposed in this pull request? Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")` ### Why are the changes needed? Avoids ~50 deprecation warnings when attempting to build with 2.13. ### Does this PR introduce any user-facing change? None, should be no functional change at all. ### How was this patch tested? Existing tests. Closes #26061 from srowen/SPARK-29392. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:15:37 -07:00
sandeep katta	69b0cc1962	[SPARK-28797][DOC] Document DROP FUNCTION statement in SQL Reference ### What changes were proposed in this pull request? Add DROP FUNCTION sql description in SQL reference ### Why are the changes needed? Currently from spark there is no complete sql guide is present, so it is better to document all the sql commands, this jira is sub part of this task. ### Does this PR introduce any user-facing change? Yes before user cannot find any reference for drop function command in the spark docs. After Fix: ![image](https://user-images.githubusercontent.com/35216143/66134570-240cd300-e616-11e9-9c78-259c0d355378.png) ![image](https://user-images.githubusercontent.com/35216143/65397825-d059e880-ddd0-11e9-8bd3-a65ccae56063.png) ![image](https://user-images.githubusercontent.com/35216143/66404731-9f032e80-ea06-11e9-8fef-1e266efa4c66.png) ### How was this patch tested? tested with jekyll build Closes #25553 from sandeep-katta/28797. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-08 19:47:39 -05:00
gwang3	b3eba29493	[SPARK-29189][FOLLOW-UP][SQL] Beautify config name ### What changes were proposed in this pull request? Beautify comment ### Why are the changes needed? The config name now is pretty weird. ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test. Closes #26054 from wangshisan/SPARK-29189. Authored-by: gwang3 <gwang3@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 15:44:42 -07:00
Imran Rashid	0da667d314	[SPARK-28917][CORE] Synchronize access to RDD mutable state RDD dependencies and partitions can be simultaneously accessed and mutated by user threads and spark's scheduler threads, so access must be thread-safe. In particular, as partitions and dependencies are lazily-initialized, before this change they could get initialized multiple times, which would lead to the scheduler having an inconsistent view of the pendings stages and get stuck. Tested with existing unit tests. Closes #25951 from squito/SPARK-28917. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-10-08 11:35:54 -07:00
Guilherme	de360e96d7	[SPARK-29336][SQL] Fix the implementation of QuantileSummaries.merge (guarantee that the relativeError will be respected) ### What changes were proposed in this pull request? Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug. ### Why are the changes needed? The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified. ### Does this PR introduce any user-facing change? Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example: ```scala var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } ``` In the current build it returns: ``` 16 12 10 11 17 ``` I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please? ### How was this patch tested? A new unit test was added to uncover the previous bug. Closes #26029 from sitegui/SPARK-29336. Authored-by: Guilherme <sitegui@sitegui.com.br> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-08 08:11:10 -05:00
Maxim Gekk	4e6d31f570	[SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default ### What changes were proposed in this pull request? Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`. ### Why are the changes needed? There is the agreement in the PR https://github.com/apache/spark/pull/21598#issuecomment-399695523 to change behavior in Spark 3.0. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select size(NULL); -1 ``` After: ```sql spark-sql> select size(NULL); NULL ``` ### How was this patch tested? By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples. Closes #26051 from MaxGekk/sizeof-null-returns-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-08 20:57:10 +09:00
Dilip Biswal	ef1e8495ba	[SPARK-29366][SQL] Subqueries created for DPP are not printed in EXPLAIN FORMATTED ### What changes were proposed in this pull request? The subquery expressions introduced by DPP are not printed in the newer explain command. This PR fixes the code that computes the list of subqueries in the plan. SQL df1 and df2 are partitioned on k. ``` SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k AND df2.id < 2 ``` Before ``` \|== Physical Plan == * Project (9) +- * BroadcastHashJoin Inner BuildRight (8) :- * ColumnarToRow (2) : +- Scan parquet default.df1 (1) +- BroadcastExchange (7) +- * Project (6) +- * Filter (5) +- * ColumnarToRow (4) +- Scan parquet default.df2 (3) (1) Scan parquet default.df1 Output: [id#19L, k#20L] (2) ColumnarToRow [codegen id : 2] Input: [id#19L, k#20L] (3) Scan parquet default.df2 Output: [id#21L, k#22L] (4) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (5) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (6) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (7) BroadcastExchange Input: [k#22L] (8) BroadcastHashJoin [codegen id : 2] Left keys: List(k#20L) Right keys: List(k#22L) Join condition: None (9) Project [codegen id : 2] Output : [id#19L, k#22L] Input : [id#19L, k#20L, k#22L] ``` After ``` \|== Physical Plan == * Project (9) +- * BroadcastHashJoin Inner BuildRight (8) :- * ColumnarToRow (2) : +- Scan parquet default.df1 (1) +- BroadcastExchange (7) +- * Project (6) +- * Filter (5) +- * ColumnarToRow (4) +- Scan parquet default.df2 (3) (1) Scan parquet default.df1 Output: [id#19L, k#20L] (2) ColumnarToRow [codegen id : 2] Input: [id#19L, k#20L] (3) Scan parquet default.df2 Output: [id#21L, k#22L] (4) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (5) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (6) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (7) BroadcastExchange Input: [k#22L] (8) BroadcastHashJoin [codegen id : 2] Left keys: List(k#20L) Right keys: List(k#22L) Join condition: None (9) Project [codegen id : 2] Output : [id#19L, k#22L] Input : [id#19L, k#20L, k#22L] ===== Subqueries ===== Subquery:1 Hosting operator id = 1 Hosting Expression = k#20L IN subquery25 * HashAggregate (16) +- Exchange (15) +- * HashAggregate (14) +- * Project (13) +- * Filter (12) +- * ColumnarToRow (11) +- Scan parquet default.df2 (10) (10) Scan parquet default.df2 Output: [id#21L, k#22L] (11) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (12) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (13) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (14) HashAggregate [codegen id : 1] Input: [k#22L] (15) Exchange Input: [k#22L] (16) HashAggregate [codegen id : 2] Input: [k#22L] ``` ### Why are the changes needed? Without the fix, the subqueries are not printed in the explain plan. ### Does this PR introduce any user-facing change? Yes. the explain output will be different. ### How was this patch tested? Added a test case in ExplainSuite. Closes #26039 from dilipbiswal/explain_subquery_issue. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:39:05 -07:00
Wenchen Fan	948a6e80fe	[SPARK-28892][SQL][FOLLOWUP] add resolved logical plan for UPDATE TABLE ### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later. ### Why are the changes needed? In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:36:26 -07:00
Huaxin Gao	ffddfc8584	[SPARK-29269][PYTHON][ML] Pyspark ALSModel support getters/setters ### What changes were proposed in this pull request? Add getters/setters in Pyspark ALSModel. ### Why are the changes needed? To keep parity between python and scala. ### Does this PR introduce any user-facing change? Yes. add the following getters/setters to ALSModel ``` get/setUserCol get/setItemCol get/setColdStartStrategy get/setPredictionCol ``` ### How was this patch tested? add doctest Closes #25947 from huaxingao/spark-29269. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-08 14:05:09 +08:00
gengjiaan	7d80aa553a	[MINOR][BUILD] Fix an incorrect path in license file ### What changes were proposed in this pull request? The `LICENSE` file exists a minor issue has an incorrect path. This PR will fix it. ### Why are the changes needed? This is a minor bug. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT. Closes #26050 from beliefer/resolve-minor-license-issue. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-08 14:33:03 +09:00
Dongjoon Hyun	cb501771fa	[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method ### What changes were proposed in this pull request? This PR aims the followings. - Refactor `TPCDSQueryBenchmark` to use main method to improve the usability. - Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that). - Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.) - AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used. This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following. ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` ### Why are the changes needed? Although the generated TPCDS data is random, we can keep the record. ### Does this PR introduce any user-facing change? No. (This is dev-only test benchmark). ### How was this patch tested? Manually run the benchmark. Please note that you need to have TPCDS data. ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1" ``` Closes #26049 from dongjoon-hyun/SPARK-25668. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-08 13:33:42 +09:00
Xingbo Jiang	56a3bebb1b	[SPARK-27492][DOC][FOLLOWUP] Update resource scheduling user docs ### What changes were proposed in this pull request? Fix a config name typo from the resource scheduling user docs. In case users might get confused with the wrong config name, we'd better fix this typo. ### How was this patch tested? Document change, no need to run test. Closes #26047 from jiangxb1987/doc. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-10-07 16:21:39 -07:00
Marcelo Vanzin	d2f21b0199	[SPARK-27468][CORE] Track correct storage level of RDDs and partitions Previously, the RDD level would change depending on the status reported by executors for the block they were storing, and individual blocks would reflect that. That is wrong because different blocks may be stored differently in different executors. So now the RDD tracks the user-provided storage level, while the individual partitions reflect the current storage level of that particular block, including the current number of replicas. Closes #25779 from vanzin/SPARK-27468. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-10-07 16:07:00 -05:00
gwang3	64fe82b519	[SPARK-29189][SQL] Add an option to ignore block locations when listing file ### What changes were proposed in this pull request? In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable. And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds. To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations. ### Why are the changes needed? And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement. Table Size \| Total File Number \| Total Block Number \| List File Duration(With Block Location) \| List File Duration(Without Block Location) -- \| -- \| -- \| -- \| -- 22.6T \| 30000 \| 120000 \| 16.841s \| 1.730s 28.8 T \| 42001 \| 148964 \| 10.099s \| 2.858s 3.4 T \| 20000 \| 20000 \| 5.833s \| 4.881s ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Via ut. Closes #25869 from wangshisan/SPARK-29189. Authored-by: gwang3 <gwang3@ebay.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-10-07 14:52:55 -05:00
Huaxin Gao	f0534fb9e5	[SPARK-28816][DOC][SQL] Document ADD JAR statement in SQL Reference ### What changes were proposed in this pull request? document ADD JAR statement in SQL Reference ### Why are the changes needed? To complete SQL reference ### Does this PR introduce any user-facing change? yes after change: ![image](https://user-images.githubusercontent.com/13592258/66337691-80147780-e8f4-11e9-9d7c-7c1e7ff5379a.png) ![image](https://user-images.githubusercontent.com/13592258/66337704-860a5880-e8f4-11e9-93fa-789695de29d7.png) ![image](https://user-images.githubusercontent.com/13592258/66337721-8b67a300-e8f4-11e9-9056-998187a16c7b.png) ![image](https://user-images.githubusercontent.com/13592258/66337736-928eb100-e8f4-11e9-91c5-d8935a7b93b5.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25895 from huaxingao/spark_28816. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 13:39:03 -05:00
Maxim Gekk	b10344956d	[SPARK-29342][SQL] Make casting of string values to intervals case insensitive ### What changes were proposed in this pull request? In the PR, I propose to pass the `Pattern.CASE_INSENSITIVE` flag while compiling interval patterns in `CalendarInterval`. This makes casting string values to intervals case insensitive and tolerant to case of the `interval`, `year(s)`, `month(s)`, `week(s)`, `day(s)`, `hour(s)`, `minute(s)`, `second(s)`, `millisecond(s)` and `microsecond(s)`. ### Why are the changes needed? There are at least 2 reasons: - To maintain feature parity with PostgreSQL which is not sensitive to case: ```sql # select cast('10 Days' as INTERVAL); interval ---------- 10 days (1 row) ``` - Spark is tolerant to case of interval literals. Case insensitivity in casting should be convenient for Spark users. ```sql spark-sql> SELECT INTERVAL 1 YEAR 1 WEEK; interval 1 years 1 weeks ``` ### Does this PR introduce any user-facing change? Yes, current implementation produces `NULL` for `interval`, `year`, ... `microsecond` that are not in lower case. Before: ```sql spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL); NULL ``` After: ```sql spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL); interval 1 weeks 3 days ``` ### How was this patch tested? - by new tests in `CalendarIntervalSuite.java` - new test in `CastSuite` Closes #26010 from MaxGekk/interval-case-insensitive. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-07 09:33:01 -07:00
Huaxin Gao	2399134456	[SPARK-29143][PYTHON][ML] Pyspark feature models support column setters/getters ### What changes were proposed in this pull request? add column setters/getters support in Pyspark feature models ### Why are the changes needed? keep parity between Pyspark and Scala ### Does this PR introduce any user-facing change? Yes. After the change, Pyspark feature models have column setters/getters support. ### How was this patch tested? Add some doctests Closes #25908 from huaxingao/spark-29143. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 10:55:48 -05:00
Huaxin Gao	bd213a0850	[SPARK-29360][PYTHON][ML] PySpark FPGrowthModel supports getter/setter ### What changes were proposed in this pull request? ### Why are the changes needed? Keep parity between Scala and Python ### Does this PR introduce any user-facing change? add the following getters/setter to FPGrowthModel ``` getMinSupport getNumPartitions getMinConfidence getItemsCol getPredictionCol setItemsCol setMinConfidence setPredictionCol ``` add following getters/setters to PrefixSpan ``` set/getMinSupport set/getMaxPatternLength set/getMaxLocalProjDBSize set/getSequenceCol ``` ### How was this patch tested? add doctest Closes #26035 from huaxingao/spark-29360. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 10:53:59 -05:00
Liang-Chi Hsieh	ea8b5df474	[SPARK-28938][K8S] Move to supported OpenJDK docker image for Kubernetes ### What changes were proposed in this pull request? The current docker image used by Kubernetes is `openjdk:8-alpine`. It was not supported and was removed with the commit `3eb0351b20 (diff-f95ffa3d1377774732c33f7b8368e099)`. This PR proposes to move to a supported docker image. ### Why are the changes needed? I think there are at least two reasons: 1. According to the commit, Alpine/musl is not officially supported by the OpenJDK project. 2. As no more OpenJDK 8 Alpine images, new JDK updates including security fixes , are not applied to it. See below: ``` docker run -it --rm openjdk:8-alpine java -version openjdk version "1.8.0_212" OpenJDK Runtime Environment (IcedTea 3.12.0) (Alpine 8.212.04-r0) OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode) ``` ``` docker run -it --rm openjdk:8-jdk-slim java -version openjdk version "1.8.0_222" OpenJDK Runtime Environment (build 1.8.0_222-b10) OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode) ``` ### Does this PR introduce any user-facing change? Yes. This changes the base docker image of Spark. ### How was this patch tested? Existing tests. Closes #26037 from viirya/SPARK-28938. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-07 08:52:35 -07:00
Maxim Gekk	18b7ad2fc5	[SPARK-29328][SQL] Fix calculation of mean seconds per month ### What changes were proposed in this pull request? I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year. ### Why are the changes needed? Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth. ### Does this PR introduce any user-facing change? Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases. Before: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.4516129 ``` After: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.45996838 ``` ### How was this patch tested? By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #25998 from MaxGekk/days-in-year. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 08:47:46 -05:00
Maxim Gekk	932e2619ce	[SPARK-29365][SQL] Support dates and timestamps subtraction ### What changes were proposed in this pull request? Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp. ### Why are the changes needed? - To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date: ```sql maxim=# select timestamp'now' - date'epoch'; ?column? ---------------------------- 18175 days 21:07:33.412875 (1 row) maxim=# select date'2020-01-01' - timestamp'now'; ?column? ------------------------- 86 days 02:52:00.945296 (1 row) ``` - To conform to the SQL standard which defines datetime subtraction as an interval. ### Does this PR introduce any user-facing change? Yes, currently the queries bellow fails with the error: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7; 'Project [unresolvedalias((1570385107234000 - 18170), None)] +- OneRowRelation ``` after the changes: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds ``` ### How was this patch tested? - Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite` - by 2 new test in `datetime.sql` Closes #26036 from MaxGekk/date-timestamp-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-07 16:47:00 +09:00
Huaxin Gao	5a512e86e9	[SPARK-28800][DOC][SQL] Document REPAIR TABLE statement in SQL Reference ### What changes were proposed in this pull request? Document REPAIR TABLE statement in SQL Reference. ### Why are the changes needed? To complete SQL reference. ### Does this PR introduce any user-facing change? Yes. After the change, we will have the following ![image](https://user-images.githubusercontent.com/13592258/66271480-461f7480-e813-11e9-9b40-cbffec1221ae.png) ![image](https://user-images.githubusercontent.com/13592258/66261968-4fb1c980-e78c-11e9-9db0-fcd6f458fd39.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25884 from huaxingao/spark-28800. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-06 11:19:13 -05:00
maruilei	77510c602a	[SPARK-29233][K8S] Add regex expression checks for executorEnv… ### What changes were proposed in this pull request? In kubernetes, there are some naming regular expression requirements and restrictions on environment variable names, such as: - In kubernetes version release-1.7 and earlier, the naming rules of pod environment variable names should meet the requirements of regular expressions: [[A-Za-z_] [A-Za-z0-9_]](https://github.com/kubernetes/kubernetes/blob/release-1.7/staging/src/k8s.io/apimachinery/pkg/util/validation/validation.go#L169) - In kubernetes version release-1.8 and later, the naming rules of pod environment variable names should meet the requirements of regular expressions: [[-. _ A-ZA-Z][-. _ A-ZA-Z0-9].](https://github.com/kubernetes/kubernetes/blob/release-1.8/staging/src/k8s.io/apimachinery/pkg/util/validation/validation.go#L305) However, in spark on k8s mode, spark should add restrictions on environmental variable names when creating executorEnv. In addition, we need to use regular expressions adapted to the high version of k8s to increase the restrictions on the names of environmental variables. Otherwise, the pod will not be created properly and the spark application will be suspended. To solve the problem above, a regular validation to executorEnv is added and committed. ### Why are the changes needed? If no validation rules are added, the environment variable names that don't meet the requirements will cause the pod to not be created properly and the application will be suspended. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add unit tests and manually run. Closes #25920 from merrily01/SPARK-29233. Authored-by: maruilei <maruilei@jd.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-06 09:41:11 -05:00
zero323	7c5db4515e	[SPARK-29363][MLLIB] Make o.a.s.regression.Regressor public ### What changes were proposed in this pull request? - Removal of `private[ml]` modifier from `Regressor`. - Marking `Regressor` as `DeveloperApi`. ### Why are the changes needed? Consistency with the rest of ML API as described in [the corresponding JIRA ticket](https://issues.apache.org/jira/browse/SPARK-29363). ### Does this PR introduce any user-facing change? Yes, as described above. ### How was this patch tested? Existing tests. Closes #26033 from zero323/SPARK-29363. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-05 18:16:28 -07:00
Xingbo Jiang	80afc79d89	[SPARK-29263][SCHEDULER][FOLLOWUP][TEST] Update `FakeTask.createTaskSet()` method ### What changes were proposed in this pull request? Update `FakeTask.createTaskSet()` method to make the generated TaskSet makes use of the `priority` param. ### How was this patch tested? All the places referring to this method pass `priority = 0`, but we still need to fix this to avoid future test failure. Closes #26030 from jiangxb1987/SPARK-292630-FOLLOWUP. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-05 14:44:58 -07:00
zero323	8556710409	[SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _IsotonicRegressionBase ### What changes were proposed in this pull request? Adds ```python class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ... ``` with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`: `e97b55d322/python/pyspark/ml/wrapper.py (L377)` and `JavaPredictionModel` `e97b55d322/python/pyspark/ml/wrapper.py (L405)` respectively. This however is inconsistent with Scala counterpart where both classes extend private `IsotonicRegressionBase` `3cb1b57809/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala (L42-L43)` This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e. ```python from pyspark.ml.regression impor IsotonicRegressionMode from pyspark.ml.param.shared import HasWeightCol issubclass(IsotonicRegressionModel, HasWeightCol) # False hasattr(model, "weightCol") # True ``` as well as introduces a bug, by adding unsupported `predict` method: ```python import inspect hasattr(model, "predict") # True inspect.getfullargspec(IsotonicRegressionModel.predict) # FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={}) IsotonicRegressionModel.predict.__doc__ # Predict label for the given features.\n\n .. versionadded:: 3.0.0' model.predict(dataset.first().features) # Py4JError: An error occurred while calling o49.predict. Trace: # py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist # ... ``` Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes. ### Does this PR introduce any user-facing change? Yes. It: - Removes invalid `IsotonicRegressionModel.predict` method. - Adds `HasWeightColumn` to `IsotonicRegressionModel`. however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776). ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-04 18:06:10 -05:00
zero323	df22535bbd	[SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _AFTSurvivalRegressionParams ### What changes were proposed in this pull request? Adds ```python _AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth): ... ``` with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`: `e97b55d322/python/pyspark/ml/wrapper.py (L377)` and `JavaPredictionModel` `e97b55d322/python/pyspark/ml/wrapper.py (L405)` respectively. This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase` `eb037a8180/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L48-L50)` This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p)) ``` from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth from pyspark.ml.param import Param issubclass(AFTSurvivalRegressionModel, HasMaxIter) # False hasattr(model, "maxIter") and isinstance(model.maxIter, Param) # True issubclass(AFTSurvivalRegressionModel, HasTol) # False hasattr(model, "tol") and isinstance(model.tol, Param) # True ``` and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](https://github.com/apache/spark/pull/26023), current implementation is technically speaking correct, though incomplete). ### Does this PR introduce any user-facing change? Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet. ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-04 18:04:21 -05:00
Huaxin Gao	228b1ea96c	[SPARK-28813][DOC][SQL] Document SHOW CREATE TABLE in SQL Reference ### What changes were proposed in this pull request? Document SHOW CREATE TABLE statement in SQL Reference ### Why are the changes needed? To complete the SQL reference. ### Does this PR introduce any user-facing change? Yes. after the change: ![image](https://user-images.githubusercontent.com/13592258/66239427-b2349800-e6ae-11e9-8f78-f9e8ed85ab3b.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25885 from huaxingao/spark-28813. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-04 16:16:00 -05:00
Yuanjian Li	130e9ae5dc	[SPARK-29357][SQL][TESTS] Fix flaky test by changing to use AtomicLong ### What changes were proposed in this pull request? Change to use AtomicLong instead of a var in the test. ### Why are the changes needed? Fix flaky test. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26020 from xuanyuanking/SPARK-25159. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 10:11:31 -07:00
HyukjinKwon	20ee2f5dcb	[SPARK-29286][PYTHON][TESTS] Uses UTF-8 with 'replace' on errors at Python testing script ### What changes were proposed in this pull request? This PR proposes to let Python 2 uses UTF-8, instead of ASCII, with permissively replacing non-UDF-8 unicodes into unicode points in Python testing script. ### Why are the changes needed? When Python 2 is used to run the Python testing script, with `decode(encoding='ascii')`, it fails whenever non-ascii codes are printed out. ### Does this PR introduce any user-facing change? To dev, it will enable to support to print out non-ASCII characters. ### How was this patch tested? Jenkins will test it for our existing test codes. Also, manually tested with UTF-8 output. Closes #26021 from HyukjinKwon/SPARK-29286. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 10:04:28 -07:00
Maxim Gekk	eecef75350	[SPARK-29355][SQL] Support timestamps subtraction ### What changes were proposed in this pull request? Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`. ### Why are the changes needed? To maintain feature parity with PostgreSQL which allows to get timestamp difference: ```sql # select timestamp'today' - timestamp'yesterday'; ?column? ---------- 1 day (1 row) ``` ### Does this PR introduce any user-facing change? Yes, previously users got the following error from timestamp subtraction: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7; 'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)] +- OneRowRelation ``` after the changes they should get an interval: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; interval 1 days ``` ### How was this patch tested? - Added tests for `TimestampDiff` to `DateExpressionsSuite` - By new test in `TypeCoercionSuite`. - Enabled tests in `postgreSQL/timestamp.sql`. Closes #26022 from MaxGekk/timestamp-diff. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 09:39:19 -07:00
Wenchen Fan	275e044ba8	[SPARK-29039][SQL] centralize the catalog and table lookup logic ### What changes were proposed in this pull request? Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is 1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`. 2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`. However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places. This PR centralizes the catalog/table lookup logic by 3 rules. 1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command. 2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`. 3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore. ### Why are the changes needed? Reduce duplicated code and make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25747 from cloud-fan/lookup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 16:21:13 +08:00
Yuanjian Li	93289b54f5	[SPARK-29203][TESTS][MINOR][FOLLOW UP] Add access modifier for sparkConf in SQLQueryTestSuite ### What changes were proposed in this pull request? Add access modifier `protected` for `sparkConf` in SQLQueryTestSuite, because in the parent trait SharedSparkSession, it is protected. ### Why are the changes needed? Code consistency. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26019 from xuanyuanking/SPARK-29203. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-04 16:54:47 +09:00
Gengliang Wang	91747bd91b	[SPARK-29326][SQL] ANSI store assignment policy: throw exception on casting failure ### What changes were proposed in this pull request? 1. With ANSI store assignment policy, an exception is thrown on casting failure 2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`. ### Why are the changes needed? As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field. ### Does this PR introduce any user-facing change? With ANSI store assignment policy, an exception is thrown on casting failure ### How was this patch tested? Unit test Closes #25997 from gengliangwang/newCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 15:53:38 +08:00
DB Tsai	8b71e545d5	[SPARK-29351][CORE] Avoid Full Synchronization in ShuffleMapStage ### What changes were proposed in this pull request? This PR use read/write locks instead of `synchronized`. ### Why are the changes needed? In one of our production streaming jobs that has more than 1k executors, and each has 20 cores, Spark spends significant portion of time (30s) in sending out the `ShuffeStatus`. We find there are two issues. 1. In driver's message loop, it's calling `serializedMapStatus` which is in sync block. When the job scales really big, it can cause the contention. 2. When the job is big, the `MapStatus` is huge as well, the serialization time and compression time is slow. This PR aims to address the first problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Test with existing test cases. Closes #26017 from dbtsai/readWriteLock. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-10-04 03:14:10 +00:00
HyukjinKwon	0f48aafab8	[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) ### What changes were proposed in this pull request? This PR proposes: 1. Use `is.data.frame` to check if it is a DataFrame. 2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI. ### Why are the changes needed? 1. To support SparkR with Arrow 0.14 2. To check if there's any regression and if it works correctly. ### Does this PR introduce any user-facing change? ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` Before: ``` Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument ``` After: ``` gear 1 5 2 5 3 5 4 4 5 4 6 4 7 4 8 5 9 5 ... ``` ### How was this patch tested? AppVeyor Closes #25993 from HyukjinKwon/arrow-r-appveyor. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-04 08:56:45 +09:00
maryannxue	8fabbab299	[SPARK-29350] Fix BroadcastExchange reuse in Dynamic Partition Pruning ### What changes were proposed in this pull request? Dynamic partition pruning filters are added as an in-subquery containing a `BroadcastExchangeExec` in case of a broadcast hash join. This PR makes the `ReuseExchange` rule visit in-subquery nodes, to ensure the new `BroadcastExchangeExec` added by dynamic partition pruning can be reused. ### Why are the changes needed? This initial dynamic partition pruning PR did not enable this reuse, which means a broadcast exchange would be executed twice, in the main query and in the DPP filter. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added broadcast exchange reuse check in `DynamicPartitionPruningSuite` Closes #26015 from maryannxue/exchange-reuse. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-03 16:11:32 -07:00
zhouyongjin	aedf090ab7	[SPARK-25468][WEBUI][FOLLOWUP] Current page index keep style with dataTable in the spark UI ### What changes were proposed in this pull request? Current page index keep style with dataTable in the spark UI. https://issues.apache.org/jira/browse/SPARK-25468 move .paginate_button.active > a { color: #999999; text-decoration: underline; } add .paginate_button.active { border: 1px solid #979797 !important; background: white linear-gradient(to bottom, #fff 0%, #dcdcdc 100%); } ### How was this patch tested? ![image](https://user-images.githubusercontent.com/8166352/65872683-b5303f80-e3b3-11e9-9785-9e1d15b0b3cf.png) ![image](https://user-images.githubusercontent.com/8166352/65872692-ba8d8a00-e3b3-11e9-8cf6-db6f905d3387.png) Closes #25976 from YongjinZhou/master. Authored-by: zhouyongjin <zhouyongjin@inspur.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-03 14:03:21 -05:00
Nik Vanderhoof	6f687691ef	[SPARK-28962][SPARK-27297][SQL] Add overload for filter with index to functions object ### What changes were proposed in this pull request? Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. ### Why are the changes needed? See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: https://github.com/apache/spark/pull/24232#issuecomment-533288653 ### Does this PR introduce any user-facing change? ### How was this patch tested? Updated the these test suites: `test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite` and `org.apache.spark.sql.DataFrameFunctionsSuite` Closes #26007 from nvander1/add_index_overload_for_filter. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-03 11:12:14 -07:00

1 2 3 4 5 ...

25394 commits