ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Maxim Gekk	d33ae2e9ed	[SPARK-26953][CORE][TEST] Disable result checking in the test: java.lang.ArrayIndexOutOfBoundsException in TimSort ## What changes were proposed in this pull request? I propose to disable (comment) result checking in `SorterSuite`.`java.lang.ArrayIndexOutOfBoundsException in TimSort` because: 1. The check is optional, and correctness of TimSort is checked by another tests. Purpose of the test is to check that TimSort doesn't fail with `ArrayIndexOutOfBoundsException`. 2. Significantly drops execution time of the test. Here are timing of running the test locally: ``` Sort: 1.4 seconds Result checking: 15.6 seconds ``` ## How was this patch tested? By `SorterSuite`. Closes #24343 from MaxGekk/timsort-test-speedup. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-11 07:58:57 -07:00
Gengliang Wang	4177292dcd	[SPARK-27435][SQL] Support schema pruning in ORC V2 ## What changes were proposed in this pull request? Currently, the optimization rule `SchemaPruning` only works for Parquet/Orc V1. We should have the same optimization in ORC V2. ## How was this patch tested? Unit test Closes #24338 from gengliangwang/schemaPruningForV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-11 20:03:32 +08:00
chakravarthiT	074533334d	[SPARK-27088][SQL] Add a configuration to set log level for each batch at RuleExecutor ## What changes were proposed in this pull request? Similar to #22406 , which has made log level for plan changes by each rule configurable ,this PR is to make log level for plan changes by each batch configurable,and I have reused the same configuration: "spark.sql.optimizer.planChangeLog.level". Config proposed in this PR , spark.sql.optimizer.planChangeLog.batches - enable plan change logging only for a set of specified batches, separated by commas. ## How was this patch tested? Added UT , also tested manually and attached screenshots below. 1)Setting spark.sql.optimizer.planChangeLog.leve to warn. ![settingLogLevelToWarn](https://user-images.githubusercontent.com/45845595/54556730-8803dd00-49df-11e9-95ab-ebb0c8d735ef.png) 2)setting spark.sql.optimizer.planChangeLog.batches to Resolution and Subquery. ![settingBatchestoLog](https://user-images.githubusercontent.com/45845595/54556740-8cc89100-49df-11e9-80ab-fbbbe1ff2cdf.png) 3) plan change logging enabled only for a set of specified batches(Resolution and Subquery) ![batchloggingOp](https://user-images.githubusercontent.com/45845595/54556788-ab2e8c80-49df-11e9-9ae0-57815f552896.png) Closes #24136 from chakravarthiT/logBatches. Lead-authored-by: chakravarthiT <45845595+chakravarthiT@users.noreply.github.com> Co-authored-by: chakravarthiT <tcchakra@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-11 10:02:27 +09:00
ocaballero	181d190c60	[MINOR][SQL] Unnecessary access to externalCatalog Necessarily access the external catalog without having to do it ## What changes were proposed in this pull request? The existsFunction function has been changed because it unnecessarily accessed the externalCatalog to find if the database exists in cases where the function is in the functionRegistry ## How was this patch tested? It has been tested through spark-shell and accessing the metastore logs of hive. Inside spark-shell we use spark.table (% tableA%). SelectExpr ("trim (% columnA%)") in the current version and it appears every time: org.apache.hadoop.hive.metastore.HiveMetaStore.audit: cmd = get_database: default Once the change is made, no record appears Closes #24312 from OCaballero/master. Authored-by: ocaballero <oliver.caballero.alvarez@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-11 10:00:09 +09:00
Maxim Gekk	ab8710b579	[SPARK-27423][SQL] Cast DATE <-> TIMESTAMP according to the SQL standard ## What changes were proposed in this pull request? According to SQL standard, value of `DATE` type is union of year, month, dayInMonth, and it is independent from any time zones. To convert it to Catalyst's `TIMESTAMP`, `DATE` value should be "extended" by the time at midnight - `00:00:00`. The resulted local date+time should be considered as a timestamp in the session time zone, and casted to microseconds since epoch in `UTC` accordingly. The reverse casting from `TIMESTAMP` to `DATE` should be performed in the similar way. `TIMESTAMP` values should be represented as a local date+time in the session time zone. And the time component should be just removed. For example, `TIMESTAMP 2019-04-10 00:10:12` -> `DATE 2019-04-10`. The resulted date is converted to days since epoch `1970-01-01`. ## How was this patch tested? The changes were tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite` and `CastSuite`. Closes #24332 from MaxGekk/cast-timestamp-to-date2. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 22:41:19 +08:00
Maxim Gekk	1470f23ec9	[SPARK-27422][SQL] current_date() should return current date in the session time zone ## What changes were proposed in this pull request? In the PR, I propose to revert 2 commits `06abd06112` and `61561c1c2d`, and take current date via `LocalDate.now` in the session time zone. The result is stored as days since epoch `1970-01-01`. ## How was this patch tested? It was tested by `DateExpressionsSuite`, `DateFunctionsSuite`, `DateTimeUtilsSuite`, and `ComputeCurrentTimeSuite`. Closes #24330 from MaxGekk/current-date2. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 21:54:50 +08:00
10129659	5ea4deec44	[SPARK-26012][SQL] Null and '' values should not cause dynamic partition failure of string types Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. For example, the test bellow will fail before this PR: test("Null and '' values should not cause dynamic partition failure of string types") { withTable("t1", "t2") { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" + " from t1").write.partitionBy("p").saveAsTable("t2") checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null))) } } The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'. This PR convert the empty strings to null for partition values. This is another way for PR(https://github.com/apache/spark/pull/23010) (Please fill in changes proposed in this fix) How was this patch tested? New added test. Closes #24334 from eatoncys/FileFormatWriter. Authored-by: 10129659 <chen.yanshan@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 19:54:19 +08:00
韩田田00222924	85e5d4f141	[SPARK-24872] Replace taking the $symbol with $sqlOperator in BinaryOperator's toString method ## What changes were proposed in this pull request? For BinaryOperator's toString method, it's better to use `$sqlOperator` instead of `$symbol`. ## How was this patch tested? We can test this patch with unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21826 from httfighter/SPARK-24872. Authored-by: 韩田田00222924 <han.tiantian@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 16:58:01 +08:00
Wenchen Fan	2e90574dd0	[SPARK-27414][SQL] make it clear that date type is timezone independent ## What changes were proposed in this pull request? In SQL standard, date type is a union of the `year`, `month` and `day` fields. It's timezone independent, which means it does not represent a specific point in the timeline. Spark SQL follows the SQL standard, this PR is to make it clear that date type is timezone independent 1. improve the doc to highlight that date is timezone independent. 2. when converting string to date, uses the java time API that can directly parse a `LocalDate` from a string, instead of converting `LocalDate` to a `Instant` at UTC first. 3. when converting date to string, uses the java time API that can directly format a `LocalDate` to a string, instead of converting `LocalDate` to a `Instant` at UTC first. 2 and 3 should not introduce any behavior changes. ## How was this patch tested? existing tests Closes #24325 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 16:39:28 +08:00
Ryan Blue	58674d54ba	[SPARK-27181][SQL] Add public transform API ## What changes were proposed in this pull request? This adds a public Expression API that can be used to pass partition transformations to data sources. ## How was this patch tested? Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite. Closes #24117 from rdblue/add-public-transform-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 14:30:39 +08:00
Liang-Chi Hsieh	08858f6abc	[SPARK-27253][SQL][FOLLOW-UP] Update doc about parent-session configuration priority ## What changes were proposed in this pull request? The PR #24189 changes the behavior of merging SparkConf. The existing doc is not updated for it. This is a followup of it to update the doc. ## How was this patch tested? Doc only change. Closes #24326 from viirya/SPARK-27253-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-10 13:21:21 +09:00
Sean Owen	05f6b87e81	[SPARK-27410][MLLIB] Remove deprecated / no-op mllib.KMeans getRuns, setRuns ## What changes were proposed in this pull request? Remove deprecated / no-op mllib.KMeans getRuns, setRuns mllib.KMeans has getRuns, setRuns methods which haven't done anything since Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3. ## How was this patch tested? Existing tests. Closes #24320 from srowen/SPARK-27410. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-09 19:13:35 -05:00
Bryan Cutler	f62f44f2a2	[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals ## What changes were proposed in this pull request? Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this. This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs. Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times. ## How was this patch tested? Existing tests Closes #24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-10 07:50:25 +09:00
Maxim Gekk	63e4bf42c2	[SPARK-27401][SQL] Refactoring conversion of Timestamp to/from java.sql.Timestamp ## What changes were proposed in this pull request? In the PR, I propose simpler implementation of `toJavaTimestamp()`/`fromJavaTimestamp()` by reusing existing functions of `DateTimeUtils`. This will allow to: - Simply implementation of `toJavaTimestamp()`, and handle properly negative inputs. - Detect `Long` overflow in conversion of milliseconds (`java.sql.Timestamp`) to microseconds (Catalyst's Timestamp). ## How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `CastSuite`. And by new benchmark for export/import timestamps added to `DateTimeBenchmark`: Before: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 290 335 49 17.2 58.0 1.0X Collect longs 1234 1681 487 4.1 246.8 0.2X Collect timestamps 1718 1755 63 2.9 343.7 0.2X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 283 301 19 17.7 56.6 1.0X Collect longs 1048 1087 36 4.8 209.6 0.3X Collect timestamps 1425 1479 56 3.5 285.1 0.2X ``` Closes #24311 from MaxGekk/conv-java-sql-date-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-09 15:42:27 -07:00
Gengliang Wang	3db117e43e	[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append ## What changes were proposed in this pull request? File source V2 currently incorrectly continues to use cached data even if the underlying data is overwritten. We should follow https://github.com/apache/spark/pull/13566 and fix it by invalidating and refreshes all the cached data (and the associated metadata) for any Dataframe that contains the given data source path. ## How was this patch tested? Unit test Closes #24318 from gengliangwang/invalidCache. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-09 09:25:37 -07:00
Shixiong Zhu	5ff39cd5ee	[SPARK-27394][WEBUI] Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate ## What changes were proposed in this pull request? This PR updates `AppStatusListener` to flush `LiveEntity` if necessary when receiving `SparkListenerExecutorMetricsUpdate`. This will ensure the staleness of Spark UI doesn't last more than the executor heartbeat interval. ## How was this patch tested? The new unit test. Closes #24303 from zsxwing/SPARK-27394. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-09 08:26:00 -07:00
francis0407	601fac2cb3	[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery ## What changes were proposed in this pull request? In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters. We have a sql with a scalar subquery: ``` scala val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") plan.explain(true) ``` And we get the log info of DataSourceV2Strategy: ``` Pushing operators to csv:examples/src/main/resources/t2.txt Pushed Filters: Post-Scan Filters: isnotnull(t2a#30) Output: t2a#30, t2b#31 ``` The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake. ``` == Parsed Logical Plan == 'Project [] +- 'Filter ('t2a > scalar-subquery#56 []) : +- 'Project [unresolvedalias('max('t1a), None)] : +- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == t2a: string, t2b: string Project [t2a#30, t2b#31] +- Filter (t2a#30 > scalar-subquery#56 []) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- SubqueryAlias `t1` : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- SubqueryAlias `t2` +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Optimized Logical Plan == Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- Project [t1a#13] : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Physical Plan == (1) Project [t2a#30, t2b#31] +- (1) Filter isnotnull(t2a#30) +- (1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan ``` ## How was this patch tested? ut Closes #24321 from francis0407/SPARK-27411. Authored-by: francis0407 <hanmingcong123@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 21:45:46 +08:00
mingbo_pb	3e4cfe9dbc	[SPARK-27406][SQL] UnsafeArrayData serialization breaks when two machines have different Oops size ## What changes were proposed in this pull request? ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize endpoints. When the UnsafeArrayData is serialized with Java serialization, the BYTE_ARRAY_OFFSET in memory can change if two machines have different pointer width (Oops in JVM). This PR fixes this issue by using the same way in https://github.com/apache/spark/pull/9030 ## How was this patch tested? Manual test has been done in our tpcds environment and regarding unit test case has been added as well Closes #24317 from pengbo/SPARK-27406. Authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 15:41:42 +08:00
Hyukjin Kwon	f16dfb9129	[SPARK-27328][SQL] Add 'deprecated' in ExpressionDescription for extended usage and SQL doc ## What changes were proposed in this pull request? This PR proposes to two things: 1. Add `deprecated` field to `ExpressionDescription` so that it can be shown in our SQL function documentation (https://spark.apache.org/docs/latest/api/sql/), and it can be shown via `DESCRIBE FUNCTION EXTENDED`. 2. While I am here, add some more restrictions for `note()` and `since()`. Looks some documentations are broken due to malformed `note`: ![Screen Shot 2019-03-31 at 3 00 53 PM](https://user-images.githubusercontent.com/6477701/55285518-a3e88500-53c8-11e9-9e99-41d857794fbe.png) It should start with 4 spaces and end with a newline. I added some asserts, and fixed the instances together while I am here. This is technically a breaking change but I think it's too trivial to note somewhere (and we're in Spark 3.0.0). This PR adds `deprecated` property into `from_utc_timestamp` and `to_utc_timestamp` (it's deprecated as of #24195) as examples of using this field. Now it shows the deprecation information as below: - SQL documentation is shown as below: ![Screen Shot 2019-03-31 at 3 07 31 PM](https://user-images.githubusercontent.com/6477701/55285537-2113fa00-53c9-11e9-9932-f5693a03332d.png) - `DESCRIBE FUNCTION EXTENDED from_utc_timestamp;`: ``` Function: from_utc_timestamp Class: org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp Usage: from_utc_timestamp(timestamp, timezone) - Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'. Extended Usage: Examples: > SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul'); 2016-08-31 09:00:00 Since: 1.5.0 Deprecated: Deprecated since 3.0.0. See SPARK-25496. ``` ## How was this patch tested? Manually tested via: - For documentation verification: ``` $ cd sql $ sh create-docs.sh ``` - For checking description: ``` $ ./bin/spark-sql ``` ``` spark-sql> DESCRIBE FUNCTION EXTENDED from_utc_timestamp; spark-sql> DESCRIBE FUNCTION EXTENDED to_utc_timestamp; ``` Closes #24259 from HyukjinKwon/SPARK-27328. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 13:49:42 +08:00
Wenchen Fan	051336d9dd	[SPARK-25496][SQL][FOLLOWUP] avoid using to_utc_timestamp ## What changes were proposed in this pull request? in https://github.com/apache/spark/pull/24195 , we deprecate `from/to_utc_timestamp`. This PR removes unnecessary use of `to_utc_timestamp` in the test. ## How was this patch tested? test only PR Closes #24319 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 10:13:38 +08:00
Rafael Renaudin	dfa2328e28	[SPARK-26881][MLLIB] Heuristic for tree aggregate depth Changes proposed: - Adding method to compute treeAggregate depth required to avoid exceeding driver max result size (first commit) - Using it in the computation of grammian of RowMatrix (second commit) Tests: - Unit Test wise, one unit test checking the behavior of the depth computation method - Tested at scale on hadoop cluster by doing PCA on a large dataset (needed depth 3 to succeed) Debatable choice: I'm not sure if RDD API is the right place to put the depth computation method. The advantage of it is that it allows to access driver max result size, and rdd number of partitions, to set default arguments for the method. Semantically, such a method might belong to something like org.apache.spark.util.Utils though. Closes #23983 from gagafunctor/Heuristic_for_treeAggregate_depth. Authored-by: Rafael Renaudin <renaudin.rafael@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-08 20:56:53 -05:00
Gengliang Wang	d50603a37c	[SPARK-27271][SQL] Migrate Text to File Data Source V2 ## What changes were proposed in this pull request? Migrate Text source to File Data Source V2 ## How was this patch tested? Unit test Closes #24207 from gengliangwang/textV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-08 10:15:22 -07:00
Maxim Gekk	00241733a6	[SPARK-27405][SQL][TEST] Restrict the range of generated random timestamps ## What changes were proposed in this pull request? In the PR, I propose to restrict the range of random timestamp literals generated in `LiteralGenerator. timestampLiteralGen`. The generator creates instances of `java.sql.Timestamp` by passing milliseconds since epoch as `Long` type. Converting the milliseconds to microseconds can cause arithmetic overflow of Long type because Catalyst's Timestamp type stores microseconds since epoch in `Long` type internally as well. Proposed interval of random milliseconds is `[Long.MinValue / 1000, Long.MaxValue / 1000]`. For example, generated timestamp `new java.sql.Timestamp(-3948373668011580000)` causes `Long` overflow at the method: ```scala def fromJavaTimestamp(t: Timestamp): SQLTimestamp = { ... MILLISECONDS.toMicros(t.getTime()) + NANOSECONDS.toMicros(t.getNanos()) % NANOS_PER_MICROS ... } ``` because `t.getTime()` returns `-3948373668011580000` which is multiplied by `1000` at `MILLISECONDS.toMicros`, and the result `-3948373668011580000000` is less than `Long.MinValue`. ## How was this patch tested? By `DateExpressionsSuite` in the PR https://github.com/apache/spark/pull/24311 Closes #24316 from MaxGekk/random-timestamps-gen. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-08 09:53:00 -07:00
LantaoJin	52838e74af	[SPARK-13704][CORE][YARN] Reduce rack resolution time ## What changes were proposed in this pull request? When you submit a stage on a large cluster, rack resolving takes a long time when initializing TaskSetManager because a script is invoked to resolve the rack of each host, one by one. Based on current implementation, it takes 30~40 seconds to resolve the racks in our 5000 nodes' cluster. After applied the patch, it decreased to less than 15 seconds. YARN-9332 has added an interface to handle multiple hosts in one invocation to save time. But before upgrading to the newest Hadoop, we could construct the same tool in Spark to resolve this issue. ## How was this patch tested? UT and manually testing on a 5000 node cluster. Closes #24245 from squito/SPARK-13704_update. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-08 10:47:06 -05:00
Yuming Wang	33f3c48cac	[SPARK-27176][SQL] Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4 ## What changes were proposed in this pull request? This PR mainly contains: 1. Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4. 2. Resolve compatibility issues between Hive 1.2.1 and Hive 2.3.4 in the `sql/hive` module. ## How was this patch tested? jenkins test hadoop-2.7 manual test hadoop-3: ```shell build/sbt clean package -Phadoop-3.2 -Phive export SPARK_PREPEND_CLASSES=true # rm -rf metastore_db cat <<EOF > test_hadoop3.scala spark.range(10).write.saveAsTable("test_hadoop3") spark.table("test_hadoop3").show EOF bin/spark-shell --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true -i test_hadoop3.scala ``` Closes #23788 from wangyum/SPARK-23710-hadoop3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-08 08:42:21 -07:00
Michael Allman	215609def2	[SPARK-25407][SQL] Allow nested access for non-existent field for Parquet file when nested pruning is enabled ## What changes were proposed in this pull request? As part of schema clipping in `ParquetReadSupport.scala`, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema to `ParquetRecordMaterializer`: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two. Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file. Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the `ReadContext`. _Additional description (by HyukjinKwon):_ Let's suppose that we have a Parquet schema as below: ``` message spark_schema { required int32 id; optional group name { optional binary first (UTF8); optional binary last (UTF8); } optional binary address (UTF8); } ``` Currently, the clipped schema as follows: ``` message spark_schema { optional group name { optional binary middle (UTF8); } optional binary address (UTF8); } ``` Parquet MR does not support access to the nested non-existent field (`name.middle`). To workaround this, this PR removes `name.middle` request at all to Parquet reader as below: ``` Parquet requested schema: message spark_schema { optional binary address (UTF8); } ``` and produces the record (`name.middle`) properly as the requested Catalyst schema. ``` root -- name: struct (nullable = true) \|-- middle: string (nullable = true) -- address: string (nullable = true) ``` I think technically this is what Parquet library should support since Parquet library made a design decision to produce `null` for non-existent fields IIRC. This PR targets to work around it. ## How was this patch tested? A previously ignored test case which exercises the failure scenario this PR addresses has been enabled. This closes #22880 Closes #24307 from dongjoon-hyun/SPARK-25407. Lead-authored-by: Michael Allman <msa@allman.ms> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-08 22:26:02 +09:00
Gengliang Wang	02e9f93309	[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns ## What changes were proposed in this pull request? When scanning file sources, we can prune unnecessary partition columns on constructing input partitions, so that: 1. Reduce the data transformation from Driver to Executors 2. Make it easier to implement columnar batch readers, since the partition columns are already pruned. ## How was this patch tested? Existing unit tests. Closes #24296 from gengliangwang/prunePartitionValue. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-08 15:14:02 +08:00
HyukjinKwon	18b36ee5ba	[SPARK-27253][SQL][FOLLOW-UP] Add a note about parent-session configuration priority in migration guide ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/24189. It adds a note about parent-session configuration priority. ## How was this patch tested? Manually built the site and checked. Closes #24279 from HyukjinKwon/SPARK-27253. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-08 09:14:40 +09:00
Yuming Wang	017919b636	[SPARK-27383][SQL][TEST] Avoid using hard-coded jar names in Hive tests ## What changes were proposed in this pull request? This pr avoid using hard-coded jar names(`hive-contrib-0.13.1.jar` and `hive-hcatalog-core-0.13.1.jar`) in Hive tests. This change makes it easy to change when upgrading the built-in Hive to 2.3.4. ## How was this patch tested? Existing test Closes #24294 from wangyum/SPARK-27383. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-06 18:06:52 -05:00
gengjiaan	53e31e2ca1	[SPARK-27399][STREAMING][KAFKA] Arrange scattered config and reduce hardcode for kafka 10. ## What changes were proposed in this pull request? I found a lot scattered config in `Kafka` streaming.I think should arrange these config in unified position. ## How was this patch tested? No need UT. Closes #24267 from beliefer/arrange-scattered-streaming-kafka-config. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-06 18:05:15 -05:00
cxzl25	6450c5948a	[SPARK-26992][STS] Fix STS scheduler pool correct delivery ## What changes were proposed in this pull request? The user sets the value of spark.sql.thriftserver.scheduler.pool. Spark thrift server saves this value in the LocalProperty of threadlocal type, but does not clean up after running, causing other sessions to run in the previously set pool name. ## How was this patch tested? manual tests Closes #23895 from cxzl25/thrift_server_scheduler_pool_pollute. Lead-authored-by: cxzl25 <cxzl25@users.noreply.github.com> Co-authored-by: sychen <sychen@ctrip.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-06 17:14:29 -05:00
Jose Torres	4a5768b2a2	[SPARK-27391][SS] Don't initialize a lazy val in ContinuousExecution job. ## What changes were proposed in this pull request? Fix a potential deadlock in ContinuousExecution by not initializing the toRDD lazy val. Closes #24301 from jose-torres/deadlock. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>	2019-04-05 12:56:36 -07:00
gengjiaan	979bb905b7	[SPARK-26936][SQL] Fix bug of insert overwrite local dir can not create temporary path in local staging directory ## What changes were proposed in this pull request? Th environment of my cluster as follows: ``` OS:Linux version 2.6.32-220.7.1.el6.x86_64 (mockbuildc6b18n3.bsys.dev.centos.org) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 7 00:52:02 GMT 2012 Hadoop: 2.7.2 Spark: 2.3.0 or 3.0.0(master branch) Hive: 1.2.1 ``` My spark run on deploy mode yarn-client. If I execute the SQL `insert overwrite local directory '/home/test/call_center/' select * from call_center`, a HiveException will appear as follows: `Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-10000/_temporary/0/_temporary/attempt_20190219173233_0002_m_000000_3 (exists=false, cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_000011) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)` Current spark sql generate a local temporary path in local staging directory.The schema of local temporary path start with `file`, so the HiveException appears. This PR change the local temporary path to HDFS temporary path, and use DistributedFileSystem instance copy the data from HDFS temporary path to local directory. If Spark run on local deploy mode, 'insert overwrite local directory' works fine. ## How was this patch tested? UT cannot support yarn-client mode.The test is in my product environment. Closes #23841 from beliefer/fix-bug-of-insert-overwrite-local-dir. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-05 14:02:46 -05:00
liulijia	39f75b4588	[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores ## What changes were proposed in this pull request? check spark.task.cpus before creating TaskScheduler in SparkContext ## How was this patch tested? UT Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24261 from liutang123/SPARK-27192. Authored-by: liulijia <liutang123@yeah.net> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-05 13:55:57 -05:00
Dongjoon Hyun	982c4c8e3c	[SPARK-27390][CORE][SQL][TEST] Fix package name mismatch ## What changes were proposed in this pull request? This PR aims to clean up package name mismatches. ## How was this patch tested? Pass the Jenkins. Closes #24300 from dongjoon-hyun/SPARK-27390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 11:50:37 -07:00
Sean Owen	23bde44797	[SPARK-27358][UI] Update jquery to 1.12.x to pick up security fixes ## What changes were proposed in this pull request? Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12. Add missing mustache license ## How was this patch tested? I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage. Closes #24288 from srowen/SPARK-27358. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-05 12:54:01 -05:00
Jungtaek Lim (HeartSaVioR)	a840b99daf	[MINOR][DOC] Fix html tag broken in configuration.md ## What changes were proposed in this pull request? This patch fixes wrong HTML tag in configuration.md which breaks the table tag. This is originally reported in dev mailing list: https://lists.apache.org/thread.html/744bdc83b3935776c8d91bf48fdf80d9a3fed3858391e60e343206f9%3Cdev.spark.apache.org%3E ## How was this patch tested? This change is one-liner and pretty obvious so I guess we may be able to skip testing. Closes #24304 from HeartSaVioR/MINOR-configuration-doc-html-tag-error. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 08:41:19 -07:00
gatorsmile	5678e687c6	[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- (1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes #24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-05 08:31:41 -07:00
Gengliang Wang	568db94e0c	[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema ## What changes were proposed in this pull request? In the current file source V2 framework, the schema of `FileScan` is not returned correctly if there are overlap columns between `dataSchema` and `partitionSchema`. The actual schema should be `dataSchema - overlapSchema + partitionSchema`, which might have different column order from the pushed down `requiredSchema` in `SupportsPushDownRequiredColumns.pruneColumns`. For example, if the data schema is `[a: String, b: String, c: String]` and the partition schema is `[b: Int, d: Int]`, the result schema is `[a: String, b: Int, c: String, d: Int]` in current `FileTable` and `HadoopFsRelation`. while the actual scan schema is `[a: String, c: String, b: Int, d: Int]` in `FileScan`. To fix the corner case, this PR proposes that the output schema of `FileTable` should be `dataSchema - overlapSchema + partitionSchema`, so that the column order is consistent with `FileScan`. Putting all the partition columns to the end of table schema is more reasonable. ## How was this patch tested? Unit test. Closes #24284 from gengliangwang/FixReadSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-05 13:34:46 +08:00
Aayushmaan Jain	04e53d2e3c	[SPAR-27342][SQL] Optimize Limit 0 queries ## What changes were proposed in this pull request? With this change, unnecessary file scans are avoided in case of Limit 0 queries. I added a case (rule) to `PropagateEmptyRelation` to replace `GlobalLimit 0` and `LocalLimit 0` nodes with an empty `LocalRelation`. This prunes the subtree under the Limit 0 node and further allows other rules of `PropagateEmptyRelation` to optimize the Logical Plan - while remaining semantically consistent with the Limit 0 query. For instance: Query: `SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 LIMIT 0) AS table2 ON table1.id = table2.id` Optimized Plan without fix: ``` Join Inner, (id#79 = id#87) :- Filter isnotnull(id#79) : +- Relation[id#79,num1#80] parquet +- Filter isnotnull(id#87) +- GlobalLimit 0 +- LocalLimit 0 +- Relation[id#87,num2#88] parquet ``` Optimized Plan with fix: `LocalRelation <empty>, [id#75, num1#76, id#77, num2#78]` ## How was this patch tested? Added unit tests to verify Limit 0 optimization for: - Simple query containing Limit 0 - Inner Join, Left Outer Join, Right Outer Join, Full Outer Join queries containing Limit 0 as one of their children - Nested Inner Joins between 3 tables with one of them having a Limit 0 clause. - Intersect query wherein one of the subqueries was a Limit 0 query. Closes #24271 from aayushmaanjain/optimize-limit0. Authored-by: Aayushmaan Jain <aayushmaan.jain42@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-04 21:19:40 -07:00
Ruben Fiszel	0e44a51f2e	[SPARK-24345][SQL] Improve ParseError stop location when offending symbol is a token In the case where the offending symbol is a CommonToken, this PR increases the accuracy of the start and stop origin by leveraging the start and stop index information from CommonToken. Closes #21334 from rubenfiszel/patch-1. Lead-authored-by: Ruben Fiszel <rubenfiszel@gmail.com> Co-authored-by: rubenfiszel <rfiszel@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-04 18:20:34 -05:00
Dongjoon Hyun	938d954375	[SPARK-27382][SQL][TEST] Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Since Apache Spark 2.4.1 vote passed and is distributed into mirrors, we need to test 2.4.1. This should land on both `master` and `branch-2.4`. ## How was this patch tested? Pass the Jenkins. Closes #24292 from dongjoon-hyun/SPARK-27382. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-04 13:49:56 -07:00
Wenchen Fan	f7bd1ab586	[SPARK-26811][SQL][FOLLOWUP] some more document fixes ## What changes were proposed in this pull request? while working on https://github.com/apache/spark/pull/24129, I realized that I missed some document fixes in https://github.com/apache/spark/pull/24285. This PR covers all of them. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #24295 from cloud-fan/doc.	2019-04-05 01:07:08 +08:00
Yuming Wang	1d95dea307	[SPARK-27349][SQL] Dealing with TimeVars removed in Hive 2.x ## What changes were proposed in this pull request? `hive.stats.jdbc.timeout` and `hive.stats.retries.wait` were removed by [HIVE-12164](https://issues.apache.org/jira/browse/HIVE-12164). This pr to deal with this change. ## How was this patch tested? unit tests Closes #24277 from wangyum/SPARK-27349. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-03 22:52:37 -07:00
Wenchen Fan	b56e433b54	[SPARK-27338][CORE][FOLLOWUP] remove trailing space ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes #24289 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 11:43:20 +08:00
Wenchen Fan	5c50f68253	[SPARK-26811][SQL][FOLLOWUP] fix some documentation ## What changes were proposed in this pull request? It's a followup of https://github.com/apache/spark/pull/24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes #24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 10:31:27 +08:00
Venkata krishnan Sowrirajan	6c4552c650	[SPARK-27338][CORE] Fix deadlock in UnsafeExternalSorter.SpillableIterator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes #24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 09:58:05 +08:00
LantaoJin	69dd44af19	[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. This is an update of #24157 ## How was this patch tested? Add a UT Closes #24264 from LantaoJin/SPARK-27216. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-03 20:09:50 -05:00
Dongjoon Hyun	b51763612a	Revert "[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not" This reverts commit `5888b15d9c`.	2019-04-03 09:41:13 -07:00
Wenchen Fan	ffb362a705	[SPARK-19712][SQL][FOLLOW-UP] reduce code duplication ## What changes were proposed in this pull request? abstract some common code into a method. ## How was this patch tested? existing tests Closes #24281 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 00:37:57 +08:00

... 5 6 7 8 9 ...

24450 commits