ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	311fe6a880	[SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in DateExpressionsSuite ### What changes were proposed in this pull request? This PR modifies some codegen related tests to test escape characters for datetime functions which are time zone aware. If the timezone is absent, the formatter could result in `null` caused by `java.util.NoSuchElementException: None.get` and bypassing the real intention of those test cases. ### Why are the changes needed? fix tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing the modified test cases. Closes #28653 from yaooqinn/SPARK-31835. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-27 17:26:07 +00:00
Ali Afroozeh	f6f1e51072	[SPARK-31719][SQL] Refactor JoinSelection ### What changes were proposed in this pull request? This PR extracts the logic for selecting the planned join type out of the `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. ### Why are the changes needed? This change both cleans up the code in `JoinSelection` and allows the logic to be in one place and be used from other rules that need to make decision based on the join type before the planning time. ### Does this PR introduce _any_ user-facing change? `BuildSide`, `BuildLeft`, and `BuildRight` are moved from `org.apache.spark.sql.execution` to Catalyst in `org.apache.spark.sql.catalyst.optimizer`. ### How was this patch tested? This is a refactoring, passes existing tests. Closes #28540 from dbaliafroozeh/RefactorJoinSelection. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-27 15:49:08 +00:00
Huaxin Gao	50492c0bd3	[SPARK-31803][ML] Make sure instance weight is not negative ### What changes were proposed in this pull request? In the algorithms that support instance weight, add checks to make sure instance weight is not negative. ### Why are the changes needed? instance weight has to be >= 0.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #28621 from huaxingao/weight_check. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-27 09:01:56 -05:00
iRakson	765105b6f1	[SPARK-31638][WEBUI] Clean Pagination code for all webUI pages ### What changes were proposed in this pull request? Pagination code across pages needs to be cleaned. I have tried to clear out these things : * Unused methods * Unused method arguments * remove redundant `if` expressions * fix indentation ### Why are the changes needed? This fix will make code more readable and remove unnecessary methods and variables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually Closes #28448 from iRakson/refactorPagination. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-27 08:59:08 -05:00
beliefer	8f2b6f3a0b	[SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression ### What changes were proposed in this pull request? Some alias of expression can not display correctly in schema. This PR will fix them. - `ln` - `rint` - `lcase` - `position` ### Why are the changes needed? Improve the implement of some expression. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR will let user see the correct alias in schema. ### How was this patch tested? Jenkins test. Closes #28551 from beliefer/show-correct-alias-in-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-27 15:05:06 +09:00
Gabor Somogyi	8f1d77488c	[SPARK-31821][BUILD] Remove mssql-jdbc dependencies from Hadoop 3.2 profile ### What changes were proposed in this pull request? There is an unnecessary dependency for `mssql-jdbc`. In this PR I've removed it. ### Why are the changes needed? Unnecessary dependency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with the following configuration. - [x] Pass the dependency test. - [x] SBT with Hadoop-3.2 (https://github.com/apache/spark/pull/28640#issuecomment-634192512) - [ ] Maven with Hadoop-3.2 Closes #28640 from gaborgsomogyi/SPARK-31821. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-26 18:21:22 -07:00
HyukjinKwon	7fb2275f00	Revert "[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs" This reverts commit `a61911c50c`.	2020-05-27 10:15:33 +09:00
Max Gekk	87d34e6b96	[SPARK-31820][SQL][TESTS] Fix flaky JavaBeanDeserializationSuite ### What changes were proposed in this pull request? Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with zero seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure: - should be `2020-05-25 12:39:17` - but incorrect expected string is `2020-05-25 12:39:17.0` ### Why are the changes needed? To make `JavaBeanDeserializationSuite` stable, and avoid test failures like https://github.com/apache/spark/pull/28630#issuecomment-633695723 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I changed `7dff3b125d/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java (L207)` to ```java new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000), ``` to force zero seconds fraction. Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-26 12:13:28 +00:00
Dilip Biswal	b44acee953	[SPARK-31673][SQL] QueryExection.debug.toFile() to take an addtional explain mode param ### What changes were proposed in this pull request? Currently QueryExecution.debug.toFile dumps the query plan information in a fixed format. This PR adds an additional explain mode parameter that writes the debug information as per the user supplied format. ``` df.queryExecution.debug.toFile("/tmp/plan.txt", explainMode = ExplainMode.fromString("formatted")) ``` ``` == Physical Plan == * Filter (2) +- Scan hive default.s1 (1) (1) Scan hive default.s1 Output [2]: [c1#15, c2#16] Arguments: [c1#15, c2#16], HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#15, c2#16] (2) Filter [codegen id : 1] Input [2]: [c1#15, c2#16] Condition : (isnotnull(c1#15) AND (c1#15 > 0)) == Whole Stage Codegen == Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:220; maxConstantPoolSize:105(0.16% used); numInnerClasses:0) == (1) Filter (isnotnull(c1#15) AND (c1#15 > 0)) +- Scan hive default.s1 [c1#15, c2#16], HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#15, c2#16] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / / 012 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); / 021 / / 022 / } / 023 / / 024 / protected void processNext() throws java.io.IOException { / 025 / while ( inputadapter_input_0.hasNext()) { / 026 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 027 / / 028 / do { / 029 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 030 / int inputadapter_value_0 = inputadapter_isNull_0 ? / 031 / -1 : (inputadapter_row_0.getInt(0)); / 032 / / 033 / boolean filter_value_2 = !inputadapter_isNull_0; / 034 / if (!filter_value_2) continue; / 035 / / 036 / boolean filter_value_3 = false; / 037 / filter_value_3 = inputadapter_value_0 > 0; / 038 / if (!filter_value_3) continue; / 039 / / 040 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 041 / / 042 / boolean inputadapter_isNull_1 = inputadapter_row_0.isNullAt(1); / 043 / int inputadapter_value_1 = inputadapter_isNull_1 ? / 044 / -1 : (inputadapter_row_0.getInt(1)); / 045 / filter_mutableStateArray_0[0].reset(); / 046 / / 047 / filter_mutableStateArray_0[0].zeroOutNullBytes(); / 048 / / 049 / filter_mutableStateArray_0[0].write(0, inputadapter_value_0); / 050 / / 051 / if (inputadapter_isNull_1) { / 052 / filter_mutableStateArray_0[0].setNullAt(1); / 053 / } else { / 054 / filter_mutableStateArray_0[0].write(1, inputadapter_value_1); / 055 / } / 056 / append((filter_mutableStateArray_0[0].getRow())); / 057 / / 058 / } while(false); / 059 / if (shouldStop()) return; / 060 / } / 061 / } / 062 / / 063 */ } ``` ### Why are the changes needed? Hopefully enhances the usability of debug.toFile(..) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test in QueryExecutionSuite Closes #28493 from dilipbiswal/write_to_file. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-26 14:40:58 +09:00
Xingcan Cui	8ba2b47737	[SPARK-31792][SS][DOCS] Introduce the structured streaming UI in the Web UI doc ### What changes were proposed in this pull request? This PR adds the structured streaming UI introduction to the Web UI doc. ![image](https://user-images.githubusercontent.com/1452518/82642209-92b99380-9bdb-11ea-9a0d-cbb26040b0ef.png) ### Why are the changes needed? The structured streaming web UI introduced before was missing from the Web UI documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N.A. Closes #28609 from xccui/ss-ui-doc. Authored-by: Xingcan Cui <xccui@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-26 14:27:42 +09:00
Max Gekk	7e4f5bbd8a	[SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version ### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`: - Files saved by Spark 2.4.5 (`cee4ecbb16`) without meta info `org.apache.spark.version` - `before_1582_date_v2_4_5.snappy.parquet` with 2 date columns of the type INT32 L:DATE - `PLAIN` (8 date values of `1001-01-01`) and `PLAIN_DICTIONARY` (`1001-01-01`..`1001-01-08`). - `before_1582_timestamp_micros_v2_4_5.snappy.parquet` with 2 timestamp columns of the type INT64 L:TIMESTAMP(MICROS,true) - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_millis_v2_4_5.snappy.parquet` with 2 timestamp columns of the type INT64 L:TIMESTAMP(MILLIS,true) - `PLAIN` (8 date values of `1001-01-01 01:02:03.123`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123`..`1001-01-08 01:02:03.123`). - `before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet` with 2 timestamp columns of the type INT96 - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` with 2 timestamp columns of the type INT96 - `PLAIN_DICTIONARY` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - Files saved by Spark 2.4.6-rc3 (`570848da7c`) with the meta info `org.apache.spark.version = 2.4.6`: - `before_1582_date_v2_4_6.snappy.parquet` replaces `before_1582_date_v2_4.snappy.parquet`. And it is similar to `before_1582_date_v2_4_5.snappy.parquet` except Spark version in parquet meta info. - `before_1582_timestamp_micros_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_micros_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_micros_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_millis_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_millis_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_millis_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet` is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_int96_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. 2. Add new test "generate test files for checking compatibility with Spark 2.4" to `ParquetIOSuite` (marked as ignored). The parquet files above were generated by this test. 3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `ParquetIOSuite` to use new parquet files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite`. Closes #28630 from MaxGekk/parquet-files-update. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-26 05:15:51 +00:00
Prakhar Jain	452594f5a4	[SPARK-31810][TEST] Fix AlterTableRecoverPartitions test using incorrect api to modify RDD_PARALLEL_LISTING_THRESHOLD ### What changes were proposed in this pull request? Use the correct API in AlterTableRecoverPartition tests to modify the `RDD_PARALLEL_LISTING_THRESHOLD` conf. ### Why are the changes needed? The existing AlterTableRecoverPartitions test modify the RDD_PARALLEL_LISTING_THRESHOLD as a SQLConf using the withSQLConf API. But since, this is not a SQLConf, it is not overridden and so the test doesn't end up testing the required behaviour. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is UT Fix. UTs are still passing after the fix. Closes #28634 from prakharjain09/SPARK-31810-fix-recover-partitions. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-26 14:13:02 +09:00
HyukjinKwon	df2a1fe131	[SPARK-31808][SQL] Makes struct function's output name and class name pretty ### What changes were proposed in this pull request? This PR proposes to set the alias, and class name in its `ExpressionInfo` for `struct`. - Class name in `ExpressionInfo` - from: `org.apache.spark.sql.catalyst.expressions.NamedStruct` - to:`org.apache.spark.sql.catalyst.expressions.CreateNamedStruct` - Alias name: `named_struct(col1, v, ...)` -> `struct(v, ...)` This PR takes over https://github.com/apache/spark/pull/28631 ### Why are the changes needed? To show the correct output name and class names to users. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> sql("DESC FUNCTION struct").show(false) +------------------------------------------------------------------------------------+ \|function_desc \| +------------------------------------------------------------------------------------+ \|Function: struct \| \|Class: org.apache.spark.sql.catalyst.expressions.NamedStruct \| \|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.\| +------------------------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT struct(1, 2)").show(false) +------------------------------+ \|named_struct(col1, 1, col2, 2)\| +------------------------------+ \|[1, 2] \| +------------------------------+ ``` After: ```scala scala> sql("DESC FUNCTION struct").show(false) +------------------------------------------------------------------------------------+ \|function_desc \| +------------------------------------------------------------------------------------+ \|Function: struct \| \|Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct \| \|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.\| +------------------------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT struct(1, 2)").show(false) +------------+ \|struct(1, 2)\| +------------+ \|[1, 2] \| +------------+ ``` ### How was this patch tested? Manually tested, and Jenkins tests. Closes #28633 from HyukjinKwon/SPARK-31808. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-25 20:36:00 -07:00
Max Gekk	6c80ebbccb	[SPARK-31818][SQL] Fix pushing down filters with `java.time.Instant` values in ORC ### What changes were proposed in this pull request? Convert `java.time.Instant` to `java.sql.Timestamp` in pushed down filters to ORC datasource when Java 8 time API enabled. ### Why are the changes needed? The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`: ``` java.lang.IllegalArgumentException: Wrong value class java.time.Instant for TIMESTAMP.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75) ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `OrcFilterSuite`. Closes #28636 from MaxGekk/orc-timestamp-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-25 18:36:02 -07:00
Kent Yao	695cb617d4	[SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' ### What changes were proposed in this pull request? Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style. In this PR, we explicitly disable Narrow-Text Style for these pattern characters. ### Why are the changes needed? Without this change, there will be a silent data change. ### Does this PR introduce _any_ user-facing change? Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters. 1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns 2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users. ### How was this patch tested? add unit tests Closes #28592 from yaooqinn/SPARK-31771. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-25 15:07:41 +00:00
Max Gekk	92685c0148	[SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results ### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? 1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-25 15:00:11 +00:00
Kent Yao	0df8dd6073	[SPARK-30352][SQL] DataSourceV2: Add CURRENT_CATALOG function ### What changes were proposed in this pull request? As we support multiple catalogs with DataSourceV2, we may need the `CURRENT_CATALOG` value expression from the SQL standard. `CURRENT_CATALOG` is a general value specification in the SQL Standard, described as: > The value specified by CURRENT_CATALOG is the character string that represents the current default catalog name. ### Why are the changes needed? improve catalog v2 with ANSI SQL standard. ### Does this PR introduce any user-facing change? yes, add a new function `current_catalog()` to point the current active catalog ### How was this patch tested? add ut Closes #27006 from yaooqinn/SPARK-30352. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-25 14:27:47 +00:00
Huaxin Gao	d4007776f2	[SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator ### What changes were proposed in this pull request? Add weight support in ClusteringEvaluator ### Why are the changes needed? Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator. ### Does this PR introduce _any_ user-facing change? Yes. ClusteringEvaluator.setWeightCol ### How was this patch tested? add new unit test Closes #28553 from huaxingao/weight_evaluator. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-25 09:18:08 -05:00
Max Gekk	7f36310500	[SPARK-31802][SQL] Format Java date-time types in `Row.jsonValue` directly ### What changes were proposed in this pull request? Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR https://github.com/apache/spark/pull/28582 added the methods to avoid conversions to days and microseconds. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `RowJsonSuite`. Closes #28620 from MaxGekk/toJson-format-Java-datetime-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-25 12:50:38 +09:00
rishi	b90e10c546	[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite ### What changes were proposed in this pull request? Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows. - ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi - BroadcastNestedLoopJoin: RightOuter - BroadcastHashJoin: LeftAnti ### Why are the changes needed? For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered. I further used Intellij debugger to test the same. Closes #28330 from sririshindra/SPARK-31377. Authored-by: rishi <spothireddi@cloudera.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-25 12:44:14 +09:00
schintap	a61911c50c	[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs ### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <schintap@verizonmedia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-25 10:29:08 +09:00
William Hyun	753636e86b	[SPARK-31807][INFRA] Use python 3 style in release-build.sh ### What changes were proposed in this pull request? This PR aims to use the style that is compatible with both python 2 and 3. ### Why are the changes needed? This will help python 3 migration. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #28632 from williamhyun/use_python3_style. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-25 10:25:43 +09:00
Huaxin Gao	d0fe433b5e	[SPARK-31768][ML] add getMetrics in Evaluators ### What changes were proposed in this pull request? add getMetrics in Evaluators to get the corresponding Metrics instance, so users can use it to get any of the metrics scores. For example: ``` val trainer = new LinearRegression val model = trainer.fit(dataset) val predictions = model.transform(dataset) val evaluator = new RegressionEvaluator() val metrics = evaluator.getMetrics(predictions) val rmse = metrics.rootMeanSquaredError val r2 = metrics.r2 val mae = metrics.meanAbsoluteError val variance = metrics.explainedVariance ``` ### Why are the changes needed? Currently, Evaluator.evaluate only access to one metrics, but most users may need to get multiple metrics. This PR adds getMetrics in all the Evaluators, so users can use it to get an instance of the corresponding Metrics to get any of the metrics they want. ### Does this PR introduce _any_ user-facing change? Yes. Add getMetrics in Evaluators. For example: ``` /** * Get a RegressionMetrics, which can be used to get any of the regression * metrics such as rootMeanSquaredError, meanSquaredError, etc. * * param dataset a dataset that contains labels/observations and predictions. * return RegressionMetrics */ Since("3.1.0") def getMetrics(dataset: Dataset[_]): RegressionMetrics ``` ### How was this patch tested? Add new unit tests Closes #28590 from huaxingao/getMetrics. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-24 08:42:42 -05:00
sandeep katta	cf7463f309	[SPARK-31761][SQL] cast integer to Long to avoid IntegerOverflow for IntegralDivide operator ### What changes were proposed in this pull request? `IntegralDivide` operator returns Long DataType, so integer overflow case should be handled. If the operands are of type Int it will be casted to Long ### Why are the changes needed? As `IntegralDivide` returns Long datatype, integer overflow should not happen ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT and also tested in the local cluster After fix ![image](https://user-images.githubusercontent.com/35216143/82603361-25eccc00-9bd0-11ea-9ca7-001c539e628b.png) SQL Test After fix ![image](https://user-images.githubusercontent.com/35216143/82637689-f0250300-9c22-11ea-85c3-886ab2c23471.png) Before Fix ![image](https://user-images.githubusercontent.com/35216143/82637984-878a5600-9c23-11ea-9e47-5ce2fb923c01.png) Closes #28600 from sandeep-katta/integerOverFlow. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-24 14:50:11 +09:00
Kousuke Saruta	8441e936fc	Revert "[SPARK-31756][WEBUI] Add real headless browser support for UI test This reverts commit `d95570864a`. Closes #28624 from sarutak/revert-real-headless-browser-support. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-05-24 09:13:38 +09:00
Gengliang Wang	9fdc2a0801	[SPARK-31793][SQL] Reduce the memory usage in file scan location metadata ### What changes were proposed in this pull request? Currently, the data source scan node stores all the paths in its metadata. The metadata is kept when a SparkPlan is converted into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph in UI. However, the paths can be very large (e.g. it can be many partitions after partition pruning), while UI pages only require up to 100 bytes for the location metadata. We can reduce the paths stored in metadata to reduce memory usage. ### Why are the changes needed? Reduce unnecessary memory cost. In the heap dump of a driver, the SparkPlanInfo instances are quite large and it should be avoided: ![image](https://user-images.githubusercontent.com/1097932/82642318-8f65de00-9bc2-11ea-9c9c-f05c2b0e1c49.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #28610 from gengliangwang/improveLocationMetadata. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-23 15:00:28 -07:00
Dongjoon Hyun	64ffc66496	[SPARK-31786][K8S][BUILD] Upgrade kubernetes-client to 4.9.2 ### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` library to bring the JDK8 related fixes. Please note that JDK11 works fine without any problem. - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.9.2 - JDK8 always uses http/1.1 protocol (Prevent OkHttp from wrongly enabling http/2) ### Why are the changes needed? OkHttp "wrongly" detects the Platform as Jdk9Platform on JDK 8u251. - https://github.com/fabric8io/kubernetes-client/issues/2212 - https://stackoverflow.com/questions/61565751/why-am-i-not-able-to-run-sparkpi-example-on-a-kubernetes-k8s-cluster Although there is a workaround `export HTTP2_DISABLE=true` and `Downgrade JDK or K8s`, we had better avoid this problematic situation. ### Does this PR introduce _any_ user-facing change? No. This will recover the failures on JDK 8u252. ### How was this patch tested? - [x] Pass the Jenkins UT (https://github.com/apache/spark/pull/28601#issuecomment-632474270) - [x] Pass the Jenkins K8S IT with the K8s 1.13 (https://github.com/apache/spark/pull/28601#issuecomment-632438452) - [x] Manual testing with K8s 1.17.3. (Below) v1.17.6 result (on Minikube) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 27 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28601 from dongjoon-hyun/SPARK-K8S-CLIENT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-23 11:07:45 -07:00
iRakson	fbb3144a9c	[SPARK-31642] Add Pagination Support for Structured Streaming Page ### What changes were proposed in this pull request? Add Pagination Support for structured streaming page. Now both tables `Active Queries` and `Completed Queries` will have pagination. To implement pagination, pagination framework from #7399 is used. * Also tables will only be shown if there is at least one entry in the table. ### Why are the changes needed? * This will help users in analysing their structured streaming queries in much better way. * Other Web UI pages support pagination in their table. So this will make web UI more consistent across pages. * This can prevent potential OOM errors. ### Does this PR introduce _any_ user-facing change? Yes. Both tables will support pagination. ### How was this patch tested? Manually. I will add snapshots soon. Closes #28485 from iRakson/SPARK-31642. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-05-23 17:17:53 +09:00
Holden Karau	721cba5402	[SPARK-31791][CORE][TEST] Improve cache block migration test reliability ### What changes were proposed in this pull request? Increase the timeout and register the listener earlier to avoid any race condition of the job starting before the listener is registered. ### Why are the changes needed? The test is currently semi-flaky. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I'm currently running the following bash script on my dev machine to verify the flakiness decreases. It has gotten to 356 iterations without any test failures so I believe issue is fixed. ``` set -ex ./build/sbt clean compile package ((failures=0)) for (( i=0;i<1000;++i )); do echo "Run $i" ((failed=0)) ./build/sbt "core/testOnly org.apache.spark.scheduler.WorkerDecommissionSuite" \|\| ((failed=1)) echo "Resulted in $failed" ((failures=failures+failed)) echo "Current status is failures: $failures out of $i runs" done ``` Closes #28614 from holdenk/SPARK-31791-improve-cache-block-migration-test-reliability. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-05-22 18:19:41 -07:00
Takeshi Yamamuro	7ca73f03fb	[SPARK-29854][SQL][TESTS] Add tests to check lpad/rpad throw an exception for invalid length input ### What changes were proposed in this pull request? This PR intends to add trivial tests to check https://github.com/apache/spark/pull/27024 has already been fixed in the master. Closes #27024 ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28604 from maropu/SPARK-29854. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-23 08:48:29 +09:00
Jungtaek Lim (HeartSaVioR)	5a258b0b67	[SPARK-30915][SS] CompactibleFileStreamLog: Avoid reading the metadata log file when finding the latest batch ID ### What changes were proposed in this pull request? This patch adds the new method `getLatestBatchId()` in CompactibleFileStreamLog in complement of getLatest() which doesn't read the content of the latest batch metadata log file, and apply to both FileStreamSource and FileStreamSink to avoid unnecessary latency on reading log file. ### Why are the changes needed? Once compacted metadata log file becomes huge, writing outputs for the compact + 1 batch is also affected due to unnecessarily reading the compacted metadata log file. This unnecessary latency can be simply avoided. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Also manually tested under query which has huge metadata log on file stream sink: > before applying the patch ![Screen Shot 2020-02-21 at 4 20 19 PM](https://user-images.githubusercontent.com/1317309/75016223-d3ffb180-54cd-11ea-9063-49405943049d.png) > after applying the patch ![Screen Shot 2020-02-21 at 4 06 18 PM](https://user-images.githubusercontent.com/1317309/75016220-d235ee00-54cd-11ea-81a7-7c03a43c4db4.png) Peaks are compact batches - please compare the next batch after compact batches, especially the area of "light brown". Closes #27664 from HeartSaVioR/SPARK-30915. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-05-22 16:46:17 -07:00
Huaxin Gao	ad9532a09c	[SPARK-31612][SQL][DOCS][FOLLOW-UP] Fix a few issues in SQL ref ### What changes were proposed in this pull request? Fix a few issues in SQL Reference ### Why are the changes needed? To make SQL Reference look better ### Does this PR introduce _any_ user-facing change? Yes. before: <img width="189" alt="Screen Shot 2020-05-21 at 11 41 34 PM" src="https://user-images.githubusercontent.com/13592258/82639052-d0f38a80-9bbc-11ea-81a4-22def4ca5cc0.png"> after: <img width="195" alt="Screen Shot 2020-05-21 at 11 41 17 PM" src="https://user-images.githubusercontent.com/13592258/82639063-d5b83e80-9bbc-11ea-84d1-8361e6bee949.png"> before: <img width="763" alt="Screen Shot 2020-05-21 at 11 45 22 PM" src="https://user-images.githubusercontent.com/13592258/82639252-3e9fb680-9bbd-11ea-863c-e6a6c2f83a06.png"> after: <img width="724" alt="Screen Shot 2020-05-21 at 11 45 02 PM" src="https://user-images.githubusercontent.com/13592258/82639265-42cbd400-9bbd-11ea-8df2-fc5c255b84d3.png"> before: <img width="437" alt="Screen Shot 2020-05-21 at 11 41 57 PM" src="https://user-images.githubusercontent.com/13592258/82639072-db158900-9bbc-11ea-9963-731881cda4fd.png"> after <img width="347" alt="Screen Shot 2020-05-21 at 11 42 26 PM" src="https://user-images.githubusercontent.com/13592258/82639082-dfda3d00-9bbc-11ea-9bd2-f922cc91f175.png"> ### How was this patch tested? Manually build and check Closes #28608 from huaxingao/doc_fix. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-23 08:43:16 +09:00
TJX2014	2115c55efe	[SPARK-31710][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions ### What changes were proposed in this pull request? Add and register three new functions: `TIMESTAMP_SECONDS`, `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS` A test is added. Reference: [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds) ### Why are the changes needed? People will have convenient way to get timestamps from seconds,milliseconds and microseconds. ### Does this PR introduce _any_ user-facing change? Yes, people will have the following ways to get timestamp: ```scala sql("select TIMESTAMP_SECONDS(t.a) as timestamp from values(1230219000),(-1230219000) as t(a)").show(false) ``` ``` +-------------------------+ \|timestamp \| +-------------------------+ \|2008-12-25 23:30:00\| \|1931-01-07 16:30:00\| +-------------------------+ ``` ```scala sql("select TIMESTAMP_MILLIS(t.a) as timestamp from values(1230219000123),(-1230219000123) as t(a)").show(false) ``` ``` +-------------------------------+ \|timestamp \| +-------------------------------+ \|2008-12-25 23:30:00.123\| \|1931-01-07 16:29:59.877\| +-------------------------------+ ``` ```scala sql("select TIMESTAMP_MICROS(t.a) as timestamp from values(1230219000123123),(-1230219000123123) as t(a)").show(false) ``` ``` +------------------------------------+ \|timestamp \| +------------------------------------+ \|2008-12-25 23:30:00.123123\| \|1931-01-07 16:29:59.876877\| +------------------------------------+ ``` ### How was this patch tested? Unit test. Closes #28534 from TJX2014/master-SPARK-31710. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-22 14:16:30 +00:00
Kousuke Saruta	d95570864a	[SPARK-31756][WEBUI] Add real headless browser support for UI test ### What changes were proposed in this pull request? This PR mainly adds two things. 1. Real headless browser support for UI test 2. A test suite using headless Chrome as one instance of those browsers. Also, for environment where Chrome and Chrome driver is not installed, `ChromeUITest` tag is added to filter out the test suite. ### Why are the changes needed? In the current master, there are two problems for UI test. 1. Lots of tests especially JavaScript related ones are done manually. Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally. 2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScript enough. I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test. The test I added works somehow but some JavaScript related error is shown in unit-tests.log. ``` ======= EXCEPTION START ======== Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException] com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null\|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807) at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426) at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157) at java.lang.Thread.run(Thread.java:748) Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null\|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889) ... 10 more JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null\|function)". == CALLING JAVASCRIPT == function () { throw e; } ======= EXCEPTION END ======== ``` I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error. ``` [info] UISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped * FAILED * (17 seconds, 745 milliseconds) [info] The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1) ``` To resolve those problems, it's better to support headless browser for UI test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I tested with following patterns. Both Chrome and Chrome driver should be installed to test. 1. sbt / with chromedriver / include tag (expect to succeed) `build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"` 2. sbt / with chromedriver / exclude tag (expect to be ignored) `build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -l org.apache.spark.tags.ChromeUITest"` 3. sbt / without chromedriver / include tag (expect to be failed) `build/sbt "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"` 4. sbt / without chromedriver / exclude tag (expect to be skipped) `build/sbt -Dtest.exclude.tags=org.apache.spark.tags.ChromeUITest "testOnly org.apache.spark.ui.ChromeUISeleniumSuite"` 5. Maven / wth chromedriver / include tag (expect to succeed) `build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test` 6. Maven / with chromedriver / exclude tag (expect to be skipped) `build/mvn -Dtest.exclude.tags="org.apache.spark.tags.ChromeUITest" -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test` 7. Maven / without chromedriver / include tag (expect to be failed) `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test` 8. Maven / without chromedriver / exclude tag (expect to be skipped) `build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ChromeUITest -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite test` Closes #28578 from sarutak/real-headless-browser-support. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-22 08:24:31 -05:00
GuoPhilipse	892b600ce3	[SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark ### What changes were proposed in this pull request? add docs for sql migration-guide ### Why are the changes needed? let user know more about the cast scenarios in which Hive and Spark generate different results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need to test Closes #28605 from GuoPhilipse/spark-docs. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 22:01:38 +09:00
Wenchen Fan	ce4da29ec3	[SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string ### What changes were proposed in this pull request? This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value. If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility. ### Why are the changes needed? To keep backward compatibility with Spark 2.4. ### Does this PR introduce _any_ user-facing change? Yes. Spark 2.4: ``` scala> sql("select to_timestamp('16', 'dd')").show +------------------------+ \|to_timestamp('16', 'dd')\| +------------------------+ \| 1970-01-16 00:00:00\| +------------------------+ scala> sql("select to_date('16', 'dd')").show +-------------------+ \|to_date('16', 'dd')\| +-------------------+ \| 1970-01-16\| +-------------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +----------------------------------+ \|to_timestamp('2019 40', 'yyyy mm')\| +----------------------------------+ \| 2019-01-01 00:40:00\| +----------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +----------------------------------------------+ \|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')\| +----------------------------------------------+ \| 2019-01-01 10:10:10\| +----------------------------------------------+ ``` in branch 3.0 ``` scala> sql("select to_timestamp('16', 'dd')").show +--------------------+ \|to_timestamp(16, dd)\| +--------------------+ \| null\| +--------------------+ scala> sql("select to_date('16', 'dd')").show +---------------+ \|to_date(16, dd)\| +---------------+ \| null\| +---------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +------------------------------+ \|to_timestamp(2019 40, yyyy mm)\| +------------------------------+ \| 2019-01-01 00:00:00\| +------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +------------------------------------------+ \|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)\| +------------------------------------------+ \| 2019-01-01 00:00:00\| +------------------------------------------+ ``` After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled. ### How was this patch tested? new tests Closes #28576 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 16:10:08 +09:00
Xingbo Jiang	245aee9fc8	[SPARK-31757][CORE] Improve HistoryServerDiskManager.updateAccessTime() ### What changes were proposed in this pull request? The function `HistoryServerDiskManager`.`updateAccessTime()` would recompute the application store directory size every time it's triggered, this effort could be avoided because we already computed the new size outside the function call. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test cases. Closes #28579 from jiangxb1987/updateInfo. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2020-05-21 23:36:55 -07:00
yi.wu	83d0967dcc	[SPARK-31784][CORE][TEST] Fix test BarrierTaskContextSuite."share messages with allGather() call" ### What changes were proposed in this pull request? Change from `messages.toList.iterator` to `Iterator.single(messages.toList)`. ### Why are the changes needed? In this test, the expected result of `rdd2.collect().head` should actually be `List("0", "1", "2", "3")` but is `"0"` now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Thanks WeichenXu123 reported this problem. Closes #28596 from Ngone51/fix_allgather_test. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2020-05-21 23:34:11 -07:00
Max Gekk	60118a2426	[SPARK-31785][SQL][TESTS] Add a helper function to test all parquet readers ### What changes were proposed in this pull request? Add `withAllParquetReaders` to `ParquetTest`. The function allow to run a block of code for all available Parquet readers. ### Why are the changes needed? 1. It simplifies tests 2. Allow to test all parquet readers that could be available in projects based on Apache Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites. Closes #28598 from MaxGekk/add-withAllParquetReaders. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 09:53:35 +09:00
Gengliang Wang	db5e5fce68	Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0" This reverts commit `92877c4ef2`. Closes #28602 from gengliangwang/revertSPARK-31765. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 16:00:58 -07:00
Kousuke Saruta	92877c4ef2	[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 11:43:25 -07:00
Thomas Graves	b64688ebba	[SPARK-29303][WEB UI] Add UI support for stage level scheduling ### What changes were proposed in this pull request? This adds UI updates to support stage level scheduling and ResourceProfiles. 3 main things have been added. ResourceProfile id added to the Stage page, the Executors page now has an optional selectable column to show the ResourceProfile Id of each executor, and the Environment page now has a section with the ResourceProfile ids. Along with this the rest api for environment page was updated to include the Resource profile information. I debating on splitting the resource profile information into its own page but I wasn't sure it called for a completely separate page. Open to peoples thoughts on this. Screen shots: ![Screen Shot 2020-04-01 at 3 07 46 PM](https://user-images.githubusercontent.com/4563792/78185169-469a7000-7430-11ea-8b0c-d9ede2d41df8.png) ![Screen Shot 2020-04-01 at 3 08 14 PM](https://user-images.githubusercontent.com/4563792/78185175-48fcca00-7430-11ea-8d1d-6b9333700f32.png) ![Screen Shot 2020-04-01 at 3 09 03 PM](https://user-images.githubusercontent.com/4563792/78185176-4a2df700-7430-11ea-92d9-73c382bb0f32.png) ![Screen Shot 2020-04-01 at 11 05 48 AM](https://user-images.githubusercontent.com/4563792/78185186-4dc17e00-7430-11ea-8962-f749dd47ea60.png) ### Why are the changes needed? For user to be able to know what resource profile was used with which stage and executors. The resource profile information is also available so user debugging can see exactly what resources were requested with that profile. ### Does this PR introduce any user-facing change? Yes, UI updates. ### How was this patch tested? Unit tests and tested on yarn both active applications and with the history server. Closes #28094 from tgravescs/SPARK-29303-pr. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-05-21 13:11:35 -05:00
iRakson	f1495c5bc0	[SPARK-31688][WEBUI] Refactor Pagination framework ### What changes were proposed in this pull request? Currently while implementing pagination using the existing pagination framework, a lot of code is being copied as pointed out [here](https://github.com/apache/spark/pull/28485#pullrequestreview-408881656). I introduced some changes in `PagedTable` which is the main trait for implementing the pagination. * Added function for getting table parameters. * Added a function for table header row. This will help in maintaining consistency across the tables. All the header rows across tables will be consistent now. ### Why are the changes needed? * A lot of code is copied every time pagination is implemented for any table. * Code readability is not great as lot of HTML is embedded. * Paginating other tables will be a lot easier now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. This is mainly refactoring work, no new functionality introduced. Existing test cases should pass. Closes #28512 from iRakson/refactorPaginationFramework. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-21 13:00:00 -05:00
Vinoo Ganesh	dae79888dc	[SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener ## What changes were proposed in this pull request? This change was made as a result of the conversation on https://issues.apache.org/jira/browse/SPARK-31354 and is intended to continue work from that ticket here. This change fixes a memory leak where SparkSession listeners are never cleared off of the SparkContext listener bus. Before running this PR, the following code: ``` SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() ``` would result in a SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb, <-First instance org.apache.spark.sql.SparkSession$$anon$1fadb9a0] <- Second instance ``` After this PR, the execution of the same code above results in SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb] <-One instance ``` ## How was this patch tested? * Unit test included as a part of the PR Closes #28128 from vinooganesh/vinooganesh/SPARK-27958. Lead-authored-by: Vinoo Ganesh <vinoo.ganesh@gmail.com> Co-authored-by: Vinoo Ganesh <vganesh@palantir.com> Co-authored-by: Vinoo Ganesh <vinoo@safegraph.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-21 16:06:28 +00:00
Max Gekk	5d673319af	[SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString ### What changes were proposed in this pull request? 1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings: - TimestampFormatter: - `def format(ts: Timestamp): String` - `def format(instant: Instant): String` - DateFormatter: - `def format(date: Date): String` - `def format(localDate: LocalDate): String` 2. Re-use the added methods from `HiveResult.toHiveString` 3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests for toHiveString and new tests in `TimestampFormatterSuite`. Closes #28582 from MaxGekk/opt-format-old-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-21 04:01:19 +00:00
Dongjoon Hyun	a06768ec4d	[SPARK-31780][K8S][TESTS] Add R test tag to exclude R K8s image building and test ### What changes were proposed in this pull request? This PR aims to skip R image building and one R test during integration tests by using `--exclude-tags r`. ### Why are the changes needed? We have only one R integration test case, `Run SparkR on simple dataframe.R example`, for submission test coverage. Since this is rarely changed, we can skip this and save the efforts required for building the whole R image and running the single test. ``` KubernetesSuite: ... - Run SparkR on simple dataframe.R example Run completed in 10 minutes, 20 seconds. Total number of tests run: 20 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the K8S integration test and do the following manually. (Note that R test is skipped) ``` $ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --deploy-mode docker-for-desktop --exclude-tags r --spark-tgz $PWD/spark-*.tgz ... KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 10 minutes, 23 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28594 from dongjoon-hyun/SPARK-31780. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-20 18:33:38 -07:00
yi.wu	26bc690722	[SPARK-30689][CORE][FOLLOW-UP] Add @since annotation for ResourceDiscoveryScriptPlugin/ResourceInformation ### What changes were proposed in this pull request? Added `since 3.0.0` for `ResourceDiscoveryScriptPlugin` and `ResourceInformation`. ### Why are the changes needed? It's required for exposed APIs(https://github.com/apache/spark/pull/27689#discussion_r426488851). ### Does this PR introduce _any_ user-facing change? Yes, they can easily know when does Spark introduces the API. ### How was this patch tested? Pass Jenkins. Closes #28591 from Ngone51/followup-30689. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-21 03:11:20 +09:00
Ali Smesseim	d40ecfa3f7	[SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? This is a recreation of #28155, which was reverted due to causing test failures. The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. To improve robustness, we also make the following changes in HiveSessionImpl.close(): - Catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. - Handle not being able to access the scratch directory. When closing, all `.pipeout` files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28544 from alismess-db/hive-thriftserver-listener-update-safer-2. Authored-by: Ali Smesseim <ali.smesseim@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-20 10:30:17 -07:00
Kent Yao	7e2ed40d58	[SPARK-31759][DEPLOY] Support configurable max number of rotate logs for spark daemons ### What changes were proposed in this pull request? in `spark-daemon.sh`, `spark_rotate_log()` accepts `$2` as a custom setting for the number of maximum rotate log files, but this part of code is actually never used. This PR adds `SPARK_LOG_MAX_FILES` environment variable to represent the maximum log files of Spark daemons can rotate to. ### Why are the changes needed? the logs files that all spark daemons are hardcoded as 5, but it supposed to be configurable ### Does this PR introduce _any_ user-facing change? yes, SPARK_LOG_MAX_FILES is added to represent the maximum log files of Spark daemons can rotate to. ### How was this patch tested? verify locally for the added shell script: ```shell kentyaohulk  ~  SPARK_LOG_MAX_FILES=1 sh test.sh 1 kentyaohulk  ~  SPARK_LOG_MAX_FILES=a sh test.sh Error: SPARK_LOG_MAX_FILES must be a postive number ✘ kentyaohulk  ~  SPARK_LOG_MAX_FILES=b sh test.sh Error: SPARK_LOG_MAX_FILES must be a postive number ✘ kentyaohulk  ~  SPARK_LOG_MAX_FILES=-1 sh test.sh Error: SPARK_LOG_MAX_FILES must be a postive number ✘ kentyaohulk  ~  sh test.sh 5 ✘ kentyaohulk  ~  cat test.sh #!/bin/bash if [[ -z ${SPARK_LOG_MAX_FILES} ]] ; then num=5 elif [[ ${SPARK_LOG_MAX_FILES} -gt 0 ]]; then num=${SPARK_LOG_MAX_FILES} else echo "Error: SPARK_LOG_MAX_FILES must be a postive number" exit -1 fi ``` Closes #28580 from yaooqinn/SPARK-31759. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-20 19:18:05 +09:00
Izek Greenfield	eaf7a2a4ed	[SPARK-8981][CORE][TEST-HADOOP3.2][TEST-JAVA11] Add MDC support in Executor ### What changes were proposed in this pull request? Added MDC support in all thread pools. ThreaddUtils create new pools that pass over MDC. ### Why are the changes needed? In many cases, it is very hard to understand from which actions the logs in the executor come from. when you are doing multi-thread work in the driver and send actions in parallel. ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test added because no new functionality added it is thread pull change and all current tests pass. Closes #26624 from igreenfield/master. Authored-by: Izek Greenfield <igreenfield@axiomsl.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-20 07:41:00 +00:00

... 2 3 4 5 6 ...

27437 commits