ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	e21bb710e5	[SPARK-32991][SQL] Use conf in shared state as the original configuraion for RESET ### What changes were proposed in this pull request? #### case the case here covers the static and dynamic SQL configs behavior in `sharedState` and `sessionState`, and the specially handled config `spark.sql.warehouse.dir` the case can be found here - https://github.com/yaooqinn/sugar/blob/master/src/main/scala/com/netease/mammut/spark/training/sql/WarehouseSCBeforeSS.scala ```scala import java.lang.reflect.Field import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} object WarehouseSCBeforeSS extends App { val wh = "spark.sql.warehouse.dir" val td = "spark.sql.globalTempDatabase" val custom = "spark.sql.custom" val conf = new SparkConf() .setMaster("local") .setAppName("SPARK-32991") .set(wh, "./data1") .set(td, "bob") val sc = new SparkContext(conf) val spark = SparkSession.builder() .config(wh, "./data2") .config(td, "alice") .config(custom, "kyao") .getOrCreate() val confField: Field = spark.sharedState.getClass.getDeclaredField("conf") confField.setAccessible(true) private val shared: SparkConf = confField.get(spark.sharedState).asInstanceOf[SparkConf] println() println(s"=====> SharedState: $wh=${shared.get(wh)}") println(s"=====> SharedState: $td=${shared.get(td)}") println(s"=====> SharedState: $custom=${shared.get(custom, "")}") println(s"=====> SessionState: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState: $td=${spark.conf.get(td)}") println(s"=====> SessionState: $custom=${spark.conf.get(custom, "")}") val spark2 = SparkSession.builder().config(td, "fred").getOrCreate() println(s"=====> SessionState 2: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 2: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 2: $custom=${spark2.conf.get(custom, "")}") SparkSession.setActiveSession(spark) spark.sql("RESET") println(s"=====> SessionState RESET: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState RESET: $td=${spark.conf.get(td)}") println(s"=====> SessionState RESET: $custom=${spark.conf.get(custom, "")}") val spark3 = SparkSession.builder().getOrCreate() println(s"=====> SessionState 3: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 3: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 3: $custom=${spark2.conf.get(custom, "")}") } ``` #### outputs and analysis ``` // 1. Make the cloned spark conf in shared state respect the warehouse dir from the 1st SparkSession //=====> SharedState: spark.sql.warehouse.dir=./data1 // 2. ⏬ //=====> SharedState: spark.sql.globalTempDatabase=alice //=====> SharedState: spark.sql.custom=kyao //=====> SessionState: spark.sql.warehouse.dir=./data2 //=====> SessionState: spark.sql.globalTempDatabase=alice //=====> SessionState: spark.sql.custom=kyao //=====> SessionState 2: spark.sql.warehouse.dir=./data2 //=====> SessionState 2: spark.sql.globalTempDatabase=alice //=====> SessionState 2: spark.sql.custom=kyao // 2'.🔼 OK until here // 3. Make the below 3 ones respect the cloned spark conf in shared state with issue 1 fixed //=====> SessionState RESET: spark.sql.warehouse.dir=./data1 //=====> SessionState RESET: spark.sql.globalTempDatabase=bob //=====> SessionState RESET: spark.sql.custom= // 4. Then the SparkSessions created after RESET will be corrected. //=====> SessionState 3: spark.sql.warehouse.dir=./data1 //=====> SessionState 3: spark.sql.globalTempDatabase=bob //=====> SessionState 3: spark.sql.custom= ``` In this PR, we gather all valid config to the cloned conf of `sharedState` during being constructed, well, actually only `spark.sql.warehouse.dir` is missing. Then we use this conf as defaults for `RESET` Command. `SparkSession.clearActiveSession/clearDefaultSession` will make the shared state invisible and unsharable. They will be internal only soon (confirmed with Wenchen), so cases with them called will not be a problem. ### Why are the changes needed? bugfix for programming API to call RESET while users creating SparkContext first and config SparkSession later. ### Does this PR introduce _any_ user-facing change? yes, before this change when you use programming API and call RESET, all configs will be reset to SparkContext.conf, now they go to SparkSession.sharedState.conf ### How was this patch tested? new tests Closes #30045 from yaooqinn/SPARK-32991. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-23 05:52:38 +00:00
Max Gekk	a03d77d326	[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 15:57:03 +00:00
angerszhu	a1629b4a57	[SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location ### What changes were proposed in this pull request? Support `spark.sql.hive.metastore.jars` use HDFS location. When user need to use path to set hive metastore jars, you should set `spark.sql.hive.metasstore.jars=path` and set real path in `spark.sql.hive.metastore.jars.path` since we use `File.pathSeperator` to split path, but `FIle.pathSeparator` is `:` in unix, it will split hdfs location `hdfs://nameservice/xx`. So add new config `spark.sql.hive.metastore.jars.path` to set comma separated paths. To keep both two way supported ### Why are the changes needed? All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node. ### Does this PR introduce _any_ user-facing change? User can use HDFS location to store hive metastore jars ### How was this patch tested? Manuel tested. Closes #29881 from AngersZhuuuu/SPARK-32852. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 13:53:01 +00:00
Chao Sun	cb3fa6c936	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? This serves two purposes: - to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop. - avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #29843 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-10-22 03:21:34 +00:00
Max Gekk	ba13b94f6b	[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default ### What changes were proposed in this pull request? 1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2. Update the SQL migration guide. ### Why are the changes needed? Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites like `ParquetIOSuite`. Closes #30121 from MaxGekk/int96-exception-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 03:04:29 +00:00
Nan Zhu	35133901f7	[SPARK-32351][SQL] Show partially pushed down partition filters in explain() ### What changes were proposed in this pull request? Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output. This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain(). ### Why are the changes needed? without the patch, the explained plan is inconsistent with what is actually executed <b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1))) +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]] ``` <b> with change </b> the plan looks like (the actually executed partition filters are exhibited) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1))) +- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))] ``` ### Does this PR introduce _any_ user-facing change no ### How was this patch tested? Unit test. Closes #29831 from CodingCat/SPARK-32351. Lead-authored-by: Nan Zhu <nanzhu@uber.com> Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 11:13:16 +09:00
angerszhu	f8277d3aa3	[SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory ### What changes were proposed in this pull request? Improve error message on reading unexpected directory ### Why are the changes needed? Improve error message on reading unexpected directory ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ut Closes #30027 from AngersZhuuuu/SPARK-32069. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-18 19:02:21 -07:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Dongjoon Hyun	ec34a001ad	[SPARK-33153][SQL][TESTS] Ignore Spark 2.4 in HiveExternalCatalogVersionsSuite on Python 3.8/3.9 ### What changes were proposed in this pull request? This PR aims to ignore Apache Spark 2.4.x distribution in HiveExternalCatalogVersionsSuite if Python version is 3.8 or 3.9. ### Why are the changes needed? Currently, `HiveExternalCatalogVersionsSuite` is broken on the latest OS like `Ubuntu 20.04` because its default Python version is 3.8. PySpark 2.4.x doesn't work on Python 3.8 due to SPARK-29536. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ``` $ python3 --version Python 3.8.5 $ build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite" ... [info] All tests passed. [info] Passed: Total 1, Failed 0, Errors 0, Passed 1 ``` Closes #30044 from dongjoon-hyun/SPARK-33153. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-14 20:48:13 -07:00
xuewei.linxuewei	dc697a8b59	[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero ### What changes were proposed in this pull request? As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result. Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard. ### Why are the changes needed? SQL correctness issue. ### Does this PR introduce any user-facing change? Yes. See sql-migration-guide In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`. ### How was this patch tested? Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior. Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior. Closes #29983 from leanken/leanken-SPARK-13860. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:21:45 +00:00
Yuming Wang	543d59dfbf	[SPARK-33107][BUILD][FOLLOW-UP] Remove com.twitter:parquet-hadoop-bundle:1.6.0 and orc.classifier ### What changes were proposed in this pull request? This pr removes `com.twitter:parquet-hadoop-bundle:1.6.0` and `orc.classifier`. ### Why are the changes needed? To make code more clear and readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30005 from wangyum/SPARK-33107. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-11 21:54:56 -07:00
Yuming Wang	5e170140b0	[SPARK-33107][SQL] Remove hive-2.3 workaround code ### What changes were proposed in this pull request? This pr remove `hive-2.3` workaround code. ### Why are the changes needed? Make code more clear and readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #29996 from wangyum/SPARK-33107. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-10 16:41:42 -07:00
HyukjinKwon	2e07ed3041	[SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script ### What changes were proposed in this pull request? This PR removes the leftover of Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script. - `test-hive1.2` title is not used anymore in Jenkins - Remove some comments related to Hive 1.2 - Remove unused codes in `OrcFilters.scala` Hive - Test `spark.sql.hive.convertMetastoreOrc` disabled case for the tests added at SPARK-19809 and SPARK-22267 ### Why are the changes needed? To remove unused codes & improve test coverage ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ran the unit tests. Also It will be tested in CI in this PR. Closes #29973 from HyukjinKwon/SPARK-33082-SPARK-20202. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:04:26 -07:00
Max Gekk	c5f6af9f17	[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("orc").options(conf).load(path) ``` The underlying file system will not receive the conf options. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `OrcSourceSuite`. Closes #29976 from MaxGekk/orc-option-propagation. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-08 11:59:30 -07:00
Dongjoon Hyun	a127387a53	[SPARK-33082][SQL] Remove hive-1.2 workaround code ### What changes were proposed in this pull request? This PR removes old Hive-1.2 profile related workaround code. ### Why are the changes needed? To simply the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #29961 from dongjoon-hyun/SPARK-HIVE12. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 12:27:23 -07:00
Terry Kim	7e99fcd64e	[SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE testcat.ns") sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t Describing columns is not supported for v2 tables.; org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.; ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE spark_catalog.test") sql("DESCRIBE t i").show // 't' is resolved to a temp view +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| i\| \|data_type\| int\| \| comment\| NULL\| +---------+----------+ ``` ### Does this PR introduce _any_ user-facing change? After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29880 from imback82/describe_column_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 06:33:20 +00:00
Wenchen Fan	ec6fccb922	[SPARK-32243][SQL][FOLLOWUP] Fix compilation in HiveSessionCatalog Fix a mistake when merging https://github.com/apache/spark/pull/29054 Closes #29955 from cloud-fan/hot-fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 14:33:34 +00:00
angerszhu	ddc7012b3d	[SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error ### What changes were proposed in this pull request? When we create a UDAF function use class extended `UserDefinedAggregeteFunction`, when we call the function, in support hive mode, in HiveSessionCatalog, it will call super.makeFunctionExpression, but it will catch error such as the function need 2 parameter and we only give 1, throw exception only show ``` No handler for UDF/UDAF/UDTF xxxxxxxx ``` This is confused for develop , we should show error thrown by super method too, For this pr's UT : Before change, throw Exception like ``` No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` After this pr, throw exception ``` Spark UDAF Error: Invalid number of arguments for function longProductSum. Expected: 2; Found: 1; Hive UDF/UDAF/UDTF Error: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` ### Why are the changes needed? Show more detail error message when define UDAF ### Does this PR introduce _any_ user-facing change? People will see more detail error message when use spark sql's UDAF in hive support Mode ### How was this patch tested? Added UT Closes #29054 from AngersZhuuuu/SPARK-32243. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 09:09:19 +00:00
Dongjoon Hyun	008a2ad1f8	[SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1) ### What changes were proposed in this pull request? As of today, - SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository. - SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions. This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0. ``` <hive.group>org.spark-project.hive</hive.group> <hive.version>1.2.1.spark2</hive.version> ``` For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it. ### Why are the changes needed? - First, Apache Spark community should not use the unofficial forked release of another Apache project. - Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`. ### How was this patch tested? 1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366) 2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382) 3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.) 4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected) Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-05 15:29:56 -07:00
Cheng Su	d6f3138352	[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically ### What changes were proposed in this pull request? This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism. The feature is to add a physical plan rule right after `EnsureRequirements`: The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan. Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure. The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan . ### Why are the changes needed? To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR). ### Does this PR introduce _any_ user-facing change? A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature. ### How was this patch tested? Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala` Closes #29804 from c21/bucket-rule. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 09:01:15 +09:00
Yuanjian Li	9e6882feca	[SPARK-32885][SS] Add DataStreamReader.table API ### What changes were proposed in this pull request? This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader. ### Why are the changes needed? Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example: Application 1 for initializing and starting the streaming job: ``` val path = "/home/yuanjian.li/runtime/to_be_deleted" val tblName = "my_table" // Write some data to `my_table` spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName) // Read the table as a streaming source, write result to destination directory val table = spark.readStream.table(tblName) table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2") ``` Application 2 for appending new data: ``` // Append new data into the path spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save() ``` Check result: ``` // The desitination directory should contains all written data spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show() ``` ### Does this PR introduce _any_ user-facing change? Yes, a new API added. ### How was this patch tested? New UT added and integrated testing. Closes #29756 from xuanyuanking/SPARK-32885. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 06:50:24 +00:00
ulysses	f2fc966674	[SPARK-32877][SQL][TEST] Add test for Hive UDF complex decimal type ### What changes were proposed in this pull request? Add test to cover Hive UDF whose input contains complex decimal type. Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`. ### Why are the changes needed? For better test coverage with Hive which we compatible or not. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #29863 from ulysses-you/SPARK-32877-test. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-24 22:16:05 -07:00
Dongjoon Hyun	d7aa3b56e8	[SPARK-32889][SQL][TESTS][FOLLOWUP] Skip special column names test in Hive 1.2 ### What changes were proposed in this pull request? This PR is a followup of SPARK-32889 in order to ignore the special column names test in `hive-1.2` profile. ### Why are the changes needed? Hive 1.2 is too old to support special column names because it doesn't use Apache ORC. This will recover our `hive-1.2` Jenkins job. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the test with Hive 1.2 profile. Closes #29867 from dongjoon-hyun/SPARK-32889-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-24 16:22:08 -07:00
angerszhu	c336ddfdb8	[SPARK-32867][SQL] When explain, HiveTableRelation show limited message ### What changes were proposed in this pull request? In current mode, when explain a SQL plan with HiveTableRelation, it will show so many info about HiveTableRelation's prunedPartition, this make plan hard to read, this pr make this information simpler. Before: ![image](https://user-images.githubusercontent.com/46485123/93012078-aeeca080-f5cf-11ea-9286-f5c15eadbee3.png) For UT ``` test("Make HiveTableScanExec message simple") { withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") { withTable("df") { spark.range(30) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("hive") .mode("overwrite") .saveAsTable("df") val df = sql("SELECT df.id, df.k FROM df WHERE df.k < 2") df.explain(true) } } } ``` After this pr will show ``` == Parsed Logical Plan == 'Project ['df.id, 'df.k] +- 'Filter ('df.k < 2) +- 'UnresolvedRelation [df], [] == Analyzed Logical Plan == id: bigint, k: bigint Project [id#11L, k#12L] +- Filter (k#12L < cast(2 as bigint)) +- SubqueryAlias spark_catalog.default.df +- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L]] == Optimized Logical Plan == Filter (isnotnull(k#12L) AND (k#12L < 2)) +- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]] == Physical Plan == Scan hive default.df [id#11L, k#12L], HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]], [isnotnull(k#12L), (k#12L < 2)] ``` In my pr, I will construct `HiveTableRelation`'s `simpleString` method to avoid show too much unnecessary info in explain plan. compared to what we had before，I decrease the detail metadata of each partition and only retain the partSpec to show each partition was pruned. Since for detail information, we always don't see this in Plan but to use DESC EXTENDED statement. ### Why are the changes needed? Make plan about HiveTableRelation more readable ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #29739 from AngersZhuuuu/HiveTableScan-meta-location-info. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-21 09:15:12 +00:00
William Hyun	7892887981	[SPARK-32930][CORE] Replace deprecated isFile/isDirectory methods ### What changes were proposed in this pull request? This PR aims to replace deprecated `isFile` and `isDirectory` methods. ```diff - fs.isDirectory(hadoopPath) + fs.getFileStatus(hadoopPath).isDirectory ``` ```diff - fs.isFile(new Path(inProgressLog)) + fs.getFileStatus(new Path(inProgressLog)).isFile ``` ### Why are the changes needed? It shows deprecation warnings. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/1244/consoleFull ``` [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:815: method isFile in class FileSystem is deprecated: see corresponding Javadoc for more information. [warn] if (!fs.isFile(new Path(inProgressLog))) { ``` ``` [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/SparkContext.scala:1884: method isDirectory in class FileSystem is deprecated: see corresponding Javadoc for more information. [warn] if (fs.isDirectory(hadoopPath)) { ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #29796 from williamhyun/filesystem. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-18 18:13:11 +09:00
jzc	ea3b979e95	[SPARK-32889][SQL] orc table column name supports special characters ### What changes were proposed in this pull request? make orc table column name support special characters like `$` ### Why are the changes needed? Special characters like `$` are allowed in orc table column name by Hive. But it's error when execute command "CREATE TABLE tbl(`$` INT, b INT) using orc" in spark. it's not compatible with Hive. `Column name "$" contains invalid character(s). Please use alias to rename it.;Column name "$" contains invalid character(s). Please use alias to rename it.;org.apache.spark.sql.AnalysisException: Column name "$" contains invalid character(s). Please use alias to rename it.; at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.checkFieldName(OrcFileFormat.scala:51) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1(OrcFileFormat.scala:59) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1$adapted(OrcFileFormat.scala:59) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) ` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test Closes #29761 from jzc928/orcColSpecialChar. Authored-by: jzc <jzc@jzcMacBookPro.local> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-17 14:50:47 -07:00
sychen	92b75dc260	[SPARK-32508][SQL] Disallow empty part col values in partition spec before static partition writing ### What changes were proposed in this pull request? Write to static partition, check in advance that the partition field is empty. ### Why are the changes needed? When writing to the current static partition, the partition field is empty, and an error will be reported when all tasks are completed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add ut Closes #29316 from cxzl25/SPARK-32508. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-17 06:50:30 +00:00
yangjie01	7fdb571963	[SPARK-32890][SQL] Pass all `sql/hive` module UTs in Scala 2.13 ### What changes were proposed in this pull request? This pr fix failed cases in sql hive module in Scala 2.13 as follow: - HiveSchemaInferenceSuite (1 FAILED -> PASS) - HiveSparkSubmitSuite (1 FAILED-> PASS) - StatisticsSuite (1 FAILED-> PASS) - HiveDDLSuite (1 FAILED-> PASS) After this patch all test passed in sql hive module in Scala 2.13. ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: All tests passed. Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/hive -am -Pscala-2.13 -Phive mvn clean test -pl sql/hive -Pscala-2.13 -Phive ``` Before ``` Tests: succeeded 3662, failed 4, canceled 0, ignored 601, pending 0 * 4 TESTS FAILED * ``` After ``` Tests: succeeded 3666, failed 0, canceled 0, ignored 601, pending 0 All tests passed. ``` Closes #29760 from LuciferYang/sql-hive-test. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-16 13:42:04 -05:00
herman	c8baab1a1f	[SPARK-32879][SQL] Refactor SparkSession initial options ### What changes were proposed in this pull request? This PR refactors the way we propagate the options from the `SparkSession.Builder` to the` SessionState`. This currently done via a mutable map inside the SparkSession. These setting settings are then applied after the Session. This is a bit confusing when you expect something to be set when constructing the `SessionState`. This PR passes the options as a constructor parameter to the `SessionStateBuilder` and this will set the options when the configuration is created. ### Why are the changes needed? It makes it easier to reason about the configurations set in a SessionState than before. We recently had an incident where someone was using `SparkSessionExtensions` to create a planner rule that relied on a conf to be set. While this is in itself probably incorrect usage, it still illustrated this somewhat funky behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #29752 from hvanhovell/SPARK-32879. Authored-by: herman <herman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-15 06:24:54 +00:00
HyukjinKwon	0696f04672	[SPARK-32876][SQL] Change default fallback versions to 3.0.1 and 2.4.7 in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? The Jenkins job fails to get the versions. This was fixed by adding temporary fallbacks at https://github.com/apache/spark/pull/28536. This still doesn't work without the temporary fallbacks. See https://github.com/apache/spark/pull/29694 This PR adds new fallbacks since 2.3 is EOL and Spark 3.0.1 and 2.4.7 are released. ### Why are the changes needed? To test correctly in Jenkins. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Jenkins and GitHub Actions builds should test. Closes #29748 from HyukjinKwon/SPARK-32876. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-14 13:54:21 -07:00
sandeep.katta	2009f95340	[SPARK-32779][SQL][FOLLOW-UP] Delete Unused code ### What changes were proposed in this pull request? Follow-up PR as per the review comments in [29649](`8d45542e91 (r487140171)`) ### Why are the changes needed? Delete the un used code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #29736 from sandeep-katta/deadlockfollowup. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-12 13:22:54 -07:00
Takeshi Yamamuro	4269c2c252	[SPARK-32851][SQL][TEST] Tests should fail if errors happen when generating projection code ### What changes were proposed in this pull request? This PR intends to set `CODEGEN_ONLY` at `CODEGEN_FACTORY_MODE` in test spark context so that tests can fail if errors happen when generating expr code. ### Why are the changes needed? I noticed that the code generation of `SafeProjection` failed in the existing test (https://issues.apache.org/jira/browse/SPARK-32828) but it passed because `FALLBACK` was set at `CODEGEN_FACTORY_MODE` (by default) in `SharedSparkSession`. To get aware of these failures quickly, I think its worth setting `CODEGEN_ONLY` at `CODEGEN_FACTORY_MODE`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29721 from maropu/ExprCodegenTest. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-12 08:42:07 +09:00
Wenchen Fan	f7995c576a	Revert "[SPARK-32677][SQL] Load function resource before create" This reverts commit `05fcf26b79`.	2020-09-09 18:15:22 +00:00
sandeep.katta	b0322bf05a	[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: ============================= "worker3": waiting to lock monitor 0x00007faf0be602b8 (object 0x00000007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x00007faf0be5fc88 (object 0x0000000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: =================================================== "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x00000007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x0000000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x0000000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x00000007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2086/1088974677.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.Dataset$$$Lambda$1959/1977822284.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606) at org.apache.spark.sql.SparkSession$$Lambda$1899/424830920.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1.run(<console>:45) at java.lang.Thread.run(Thread.java:748) "worker0": at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - waiting to lock <0x0000000785c15c80 > (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:851) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:432) - locked <0x00000007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:509) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x00000007b529af58> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2086/1088974677.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.Dataset$$$Lambda$1959/1977822284.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606) at org.apache.spark.sql.SparkSession$$Lambda$1899/424830920.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1.run(<console>:45) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested using below script by executing in spark-shell and I found no dead lock launch spark-shell using ./bin/spark-shell --conf "spark.sql.hive.metastore.jars=maven" --conf spark.sql.hive.metastore.version=3.1 --conf spark.hadoop.datanucleus.schema.autoCreateAll=true code ``` def testHiveDeadLock = { import scala.collection.mutable.ArrayBuffer import scala.util.Random println("test hive DeadLock") spark.sql("drop database if exists testDeadLock cascade") spark.sql("create database testDeadLock") spark.sql("use testDeadLock") val tableCount = 100 val tableNamePrefix = "testdeadlock" for (i <- 0 until tableCount) { val tableName = s"$tableNamePrefix${i + 1}" spark.sql(s"drop table if exists $tableName") spark.sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored as orc") } val threads = new ArrayBuffer[Thread] for (i <- 0 until tableCount) { threads.append(new Thread( new Runnable { override def run: Unit = { val tableName = s"$tableNamePrefix${i + 1}" val rand = Random val df = spark.range(0, 20000).toDF("a") val location = s"/tmp/${rand.nextLong.abs}" df.write.mode("overwrite").orc(location) spark.sql( s""" LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition (b=$i)""") } }, s"worker$i")) threads(i).start() } for (i <- 0 until tableCount) { println(s"Joining with thread $i") threads(i).join() } for (i <- 0 until tableCount) { val tableName = s"$tableNamePrefix${i + 1}" spark.sql(s"select count(*) from $tableName").show(false) } println("All done") } for(i <- 0 until 100) { testHiveDeadLock println(s"completed {$i}th iteration") } } ``` Closes #29649 from sandeep-katta/metastore3.1DeadLock. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-07 15:10:33 +09:00
ulysses	05fcf26b79	[SPARK-32677][SQL] Load function resource before create ### What changes were proposed in this pull request? Change `CreateFunctionCommand` code that add class check before create function. ### Why are the changes needed? We have different behavior between create permanent function and temporary function when function class is invaild. e.g., ``` create function f as 'test.non.exists.udf'; -- Time taken: 0.104 seconds create temporary function f as 'test.non.exists.udf' -- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath; ``` And Hive also fails both of them. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception when create a invalid udf. ### How was this patch tested? New test. Closes #29502 from ulysses-you/function. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 06:00:23 +00:00
angerszhu	55ce49ed28	[SPARK-32400][SQL][TEST][FOLLOWUP][TEST-MAVEN] Fix resource loading error in HiveScripTransformationSuite ### What changes were proposed in this pull request? #29401 move `test_script.py` from sql/hive module to sql/core module, cause HiveScripTransformationSuite load resource issue. ### Why are the changes needed? This issue cause jenkins test failed in mvn spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/ spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/ spark-master-test-maven-hadoop-3.2-hive-2.3: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3/ ![image](https://user-images.githubusercontent.com/46485123/91681585-71285a80-eb81-11ea-8519-99fc9783d6b9.png) ![image](https://user-images.githubusercontent.com/46485123/91681010-aaf86180-eb7f-11ea-8dbb-61365a3b0ab4.png) Error as below: ``` Exception thrown while executing Spark plan: HiveScriptTransformation [a#349299, b#349300, c#349301, d#349302, e#349303], python /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py, [a#349309, b#349310, c#349311, d#349312, e#349313], ScriptTransformationIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) +- Project [_1#349288 AS a#349299, _2#349289 AS b#349300, _3#349290 AS c#349301, _4#349291 AS d#349302, _5#349292 AS e#349303] +- LocalTableScan [_1#349288, _2#349289, _3#349290, _4#349291, _5#349292] == Exception == org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18021.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18021.0 (TID 37324) (192.168.10.31 executor driver): org.apache.spark.SparkException: Subprocess exited with status 2. Error: python: can't open file '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py': [Errno 2] No such file or directory at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate(BaseScriptTransformationExec.scala:180) at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate$(BaseScriptTransformationExec.scala:157) at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec.checkFailureAndPropagate(HiveScriptTransformationExec.scala:49) at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec$$anon$1.hasNext(HiveScriptTransformationExec.scala:110) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at o ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Existed UT Closes #29588 from AngersZhuuuu/SPARK-32400-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 18:27:29 +09:00
Huaxin Gao	806140de40	[SPARK-32592][SQL] Make DataFrameReader.table take the specified options ### What changes were proposed in this pull request? pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable ### Why are the changes needed? Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost. ``` val df = spark.read .option("partitionColumn", "id") .option("lowerBound", "0") .option("upperBound", "3") .option("numPartitions", "2") .table("h2.test.people") ``` We need to make `DataFrameReader.table` take the specified options. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test for now. Will add a test after V2 JDBC read is implemented. Closes #29535 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-31 13:21:15 +00:00
ulysses	339eec5f32	[SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl ### What changes were proposed in this pull request? Discussion with [comment](https://github.com/apache/spark/pull/29244#issuecomment-671746329). Add `HiveVoidType` class in `HiveClientImpl` then we can replace `NullType` to `HiveVoidType` before we call hive client. ### Why are the changes needed? Better compatible with hive. More details in [#29244](https://github.com/apache/spark/pull/29244). ### Does this PR introduce _any_ user-facing change? Yes, user can create view with null type in Hive. ### How was this patch tested? New test. Closes #29423 from ulysses-you/add-HiveVoidType. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-14 06:59:15 +00:00
Max Gekk	f664aaaab1	[SPARK-32599][SQL][TESTS] Check the TEXTFILE file format in `HiveSerDeReadWriteSuite` ### What changes were proposed in this pull request? - Test TEXTFILE together with the PARQUET and ORC file formats in `HiveSerDeReadWriteSuite` - Remove the "SPARK-32594: insert dates to a Hive table" added by #29409 ### Why are the changes needed? - To improve test coverage, and test other row SerDe - `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe`. - The removed test is not needed anymore because the bug reported in SPARK-32594 is triggered by the TEXTFILE file format too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite `HiveSerDeReadWriteSuite`. Closes #29417 from MaxGekk/textfile-HiveSerDeReadWriteSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-12 20:59:59 +09:00
angerszhu	4cf8c1d07d	[SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec ### What changes were proposed in this pull request? 1. Extract common test case (no serde) to BasicScriptTransformationExecSuite 2. Add more test case for no serde mode about supported data type and behavior in `BasicScriptTransformationExecSuite` 3. Add more test case for hive serde mode about supported type and behavior in `HiveScriptTransformationExecSuite` ### Why are the changes needed? Improve test coverage of Script Transformation ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #29401 from AngersZhuuuu/SPARK-32400. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-12 06:02:42 +00:00
Max Gekk	0477d23467	[SPARK-32594][SQL] Fix serialization of dates inserted to Hive tables ### What changes were proposed in this pull request? Fix `DaysWritable` by overriding parent's method `def get(doesTimeMatter: Boolean): Date` from `DateWritable` instead of `Date get()` because the former one uses the first one. The bug occurs because `HiveOutputWriter.write()` call `def get(doesTimeMatter: Boolean): Date` transitively with default implementation from the parent class `DateWritable` which doesn't respect date rebases and uses not initialized `daysSinceEpoch` (0 which `1970-01-01`). ### Why are the changes needed? The changes fix the bug: ```sql spark-sql> CREATE TABLE table1 (d date); spark-sql> INSERT INTO table1 VALUES (date '2020-08-11'); spark-sql> SELECT * FROM table1; 1970-01-01 ``` The expected result of the last SQL statement must be 2020-08-11 but got 1970-01-01. ### Does this PR introduce _any_ user-facing change? Yes. After the fix, `INSERT` work correctly: ```sql spark-sql> SELECT * FROM table1; 2020-08-11 ``` ### How was this patch tested? Add new test to `HiveSerDeReadWriteSuite` Closes #29409 from MaxGekk/insert-date-into-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-12 13:32:16 +09:00
angerszhu	d251443a02	[SPARK-32403][SQL] Refactor current ScriptTransformationExec # What changes were proposed in this pull request? This PR comes from the comment: #29085 (comment) - Extract common Script IOSchema `ScriptTransformationIOSchema` - avoid repeated judgement extract process output row method `createOutputIteratorWithoutSerde` && `createOutputIteratorWithSerde` - add default no serde IO schemas `ScriptTransformationIOSchema.defaultIOSchema` ### Why are the changes needed? Refactor code ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NO Closes #29199 from AngersZhuuuu/spark-32105-followup. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-10 16:37:31 +00:00
GuoPhilipse	aa4d3c19fe	[SPARK-32560][SQL] Improve exception message at InsertIntoHiveTable.processInsert ### What changes were proposed in this pull request? improve exception message ### Why are the changes needed? the before message lack of single quotes, we may improve it to keep consisent. ![image](https://user-images.githubusercontent.com/46367746/89595808-15bbc300-d888-11ea-9914-b05ea7b66461.png) ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No ,it is only improving the message. Closes #29376 from GuoPhilipse/improve-exception-message. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-07 14:29:32 +09:00
Max Gekk	6664e282f6	[SPARK-32546][SQL][FOLLOWUP] Add `.toSeq` to `tableNames` in `HiveClientImpl.listTablesByType` ### What changes were proposed in this pull request? Explicitly convert `tableNames` to `Seq` in `HiveClientImpl.listTablesByType` as it was done by `c28a6fa511 (diff-6fd847124f8eae45ba2de1cf7d6296feR769)` ### Why are the changes needed? See this PR https://github.com/apache/spark/pull/29111, to compile by Scala 2.13. The changes were discarded by https://github.com/apache/spark/pull/29363 accidentally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Compiling by Scala 2.13 Closes #29379 from MaxGekk/fix-listTablesByType-for-views-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-06 14:23:56 +00:00
Max Gekk	dc96f2f8d6	[SPARK-32546][SQL] Get table names directly from Hive tables ### What changes were proposed in this pull request? Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables. ### Why are the changes needed? A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views: ``` java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found) ``` when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception. ### How was this patch tested? - By existing test suites like: ``` $ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite" ``` - And manually: 1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5` 2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive: ``` $ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar ``` 3. Create a Hive table using this SerDe: ```scala scala> :paste // Entering paste mode (ctrl-D to finish) sql(s""" \|CREATE TABLE json_table2(page_id INT NOT NULL) \|ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' \|""".stripMargin) // Exiting paste mode, now interpreting. res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TABLES").show +--------+-----------+-----------+ \|database\| tableName\|isTemporary\| +--------+-----------+-----------+ \| default\|json_table2\| false\| +--------+-----------+-----------+ scala> sql("SHOW VIEWS").show +---------+--------+-----------+ \|namespace\|viewName\|isTemporary\| +---------+--------+-----------+ +---------+--------+-----------+ ``` 4. Quit from the current `spark-shell` and run it without jars: ``` $ ./bin/spark-shell ``` 5. Show views. Without the fix, it throws the exception: ```scala scala> sql("SHOW VIEWS").show 20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605) ``` After the fix: ```scala scala> sql("SHOW VIEWS").show +---------+--------+-----------+ \|namespace\|viewName\|isTemporary\| +---------+--------+-----------+ +---------+--------+-----------+ ``` Closes #29363 from MaxGekk/fix-listTablesByType-for-views. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-06 08:35:59 +00:00
Terry Kim	45b7212fd3	[SPARK-32401][SQL] Migrate function related commands to use UnresolvedFunc to resolve function identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following function related commands to use `UnresolvedFunc` to resolve function identifier: - DROP FUNCTION - DESCRIBE FUNCTION - SHOW FUNCTIONS `DropFunctionStatement`, `DescribeFunctionStatement` and `ShowFunctionsStatement` logical plans are replaced with `DropFunction`, `DescribeFunction` and `ShowFunctions` logical plans respectively, and each contains `UnresolvedFunc` as its child so that it can be resolved in `Analyzer`. ### Why are the changes needed? Migrating to the new resolution framework, which resolves `UnresolvedFunc` in `Analyzer`. ### Does this PR introduce _any_ user-facing change? The message of exception thrown when a catalog is resolved to v2 has been merged to: `function is only supported in v1 catalog` Previously, it printed out the command used. E.g.,: `CREATE FUNCTION is only supported in v1 catalog` ### How was this patch tested? Updated existing tests. Closes #29198 from imback82/function_framework. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-29 03:55:48 +00:00
HyukjinKwon	c1140661bf	[SPARK-32443][CORE] Use POSIX-compatible `command -v` in testCommandAvailable ### What changes were proposed in this pull request? This PR aims to use `command -v` in non-Window operating systems instead of executing the given command. ### Why are the changes needed? 1. `command` is POSIX-compatible - POSIX.1-2017: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/command.html 2. `command` is faster and safer than the direct execution - `command` doesn't invoke another process. ```scala scala> sys.process.Process("ls").run().exitValue() LICENSE NOTICE bin doc lib man res1: Int = 0 ``` 3. The existing way behaves inconsistently. - `rm` cannot be checked. AS-IS ```scala scala> sys.process.Process("rm").run().exitValue() usage: rm [-f \| -i] [-dPRrvW] file ... unlink file res0: Int = 64 ``` TO-BE ``` Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> sys.process.Process(Seq("sh", "-c", s"command -v ls")).run().exitValue() /bin/ls val res1: Int = 0 ``` 4. The existing logic is already broken in Scala 2.13 environment because it hangs like the following. ```scala $ bin/scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> sys.process.Process("cat").run().exitValue() // hang here. ``` ### Does this PR introduce _any_ user-facing change? No. Although this is inside `main` source directory, this is used for testing purpose. ``` $ git grep testCommandAvailable \| grep -v 'def testCommandAvailable' core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("wc")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand)) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(!TestUtils.testCommandAvailable("some_nonexistent_command")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand)) sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: private lazy val isPythonAvailable: Boolean = TestUtils.testCommandAvailable(pythonExec) sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: if (TestUtils.testCommandAvailable(pythonExec)) { sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("python")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("echo \| sed")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) ``` ### How was this patch tested? - Scala 2.12: Pass the Jenkins with the existing tests and one modified test. - Scala 2.13: Do the following manually. It should pass instead of `hang`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.PipedRDDSuite ... Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29241 from dongjoon-hyun/SPARK-32443. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-27 12:02:43 -07:00
Sean Owen	be2eca22e9	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ ### What changes were proposed in this pull request? Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes. ### Why are the changes needed? 3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility. ### Does this PR introduce _any_ user-facing change? No, only affects tests. ### How was this patch tested? Existing tests. Closes #29196 from srowen/SPARK-32398. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 16:20:17 -07:00
Gengliang Wang	d0c83f372b	[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in https://github.com/apache/spark/pull/24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-20 14:17:31 +00:00
SaurabhChawla	6be8b935a4	[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables ### What changes were proposed in this pull request? Spark sql commands are failing on selecting the orc tables Steps to reproduce Example 1 - Prerequisite - This is the location(/Users/test/tpcds_scale5data/date_dim) for orc data which is generated by the hive. ``` val table = """CREATE TABLE `date_dim` ( `d_date_sk` INT, `d_date_id` STRING, `d_date` TIMESTAMP, `d_month_seq` INT, `d_week_seq` INT, `d_quarter_seq` INT, `d_year` INT, `d_dow` INT, `d_moy` INT, `d_dom` INT, `d_qoy` INT, `d_fy_year` INT, `d_fy_quarter_seq` INT, `d_fy_week_seq` INT, `d_day_name` STRING, `d_quarter_name` STRING, `d_holiday` STRING, `d_weekend` STRING, `d_following_holiday` STRING, `d_first_dom` INT, `d_last_dom` INT, `d_same_day_ly` INT, `d_same_day_lq` INT, `d_current_day` STRING, `d_current_week` STRING, `d_current_month` STRING, `d_current_quarter` STRING, `d_current_year` STRING) USING orc LOCATION '/Users/test/tpcds_scale5data/date_dim'""" spark.sql(table).collect val u = """select date_dim.d_date_id from date_dim limit 5""" spark.sql(u).collect ``` Example 2 ``` val table = """CREATE TABLE `test_orc_data` ( `_col1` INT, `_col2` STRING, `_col3` INT) USING orc""" spark.sql(table).collect spark.sql("insert into test_orc_data values(13, '155', 2020)").collect val df = """select _col2 from test_orc_data limit 5""" spark.sql(df).collect ``` Its Failing with below error ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)` ``` The reason behind this initBatch is not getting the schema that is needed to find out the column value in OrcFileFormat.scala ``` batchReader.initBatch( TypeDescription.fromString(resultSchemaString) ``` ### Why are the changes needed? Spark sql queries for orc tables are failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test is added for this .Also Tested through spark shell and spark submit the failing queries Closes #29045 from SaurabhChawla100/SPARK-32234. Lead-authored-by: SaurabhChawla <saurabhc@qubole.com> Co-authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-16 13:11:47 +00:00
Sean Owen	c28a6fa511	[SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation ### What changes were proposed in this pull request? Same as https://github.com/apache/spark/pull/29078 and https://github.com/apache/spark/pull/28971 . This makes the rest of the default modules (i.e. those you get without specifying `-Pyarn` etc) compile under Scala 2.13. It does not close the JIRA, as a result. this also of course does not demonstrate that tests pass yet in 2.13. Note, this does not fix the `repl` module; that's separate. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29111 from srowen/SPARK-29292.3. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-15 13:26:28 -07:00
Erik Krogen	cf22d947fb	[SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature ### What changes were proposed in this pull request? This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR. This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained. ### Why are the changes needed? As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world. ### Does this PR introduce _any_ user-facing change? In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests. ### How was this patch tested? Existing tests should be suitable since no behavior changes are expected as a result of this PR. Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists. Authored-by: Erik Krogen <ekrogen@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-15 11:40:55 -05:00
HyukjinKwon	4ad9bfd53b	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 ### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-14 11:22:44 +09:00
Holden Karau	90ac9f975b	[SPARK-32004][ALL] Drop references to slave ### What changes were proposed in this pull request? This change replaces the world slave with alternatives matching the context. ### Why are the changes needed? There is no need to call things slave, we might as well use better clearer names. ### Does this PR introduce _any_ user-facing change? Yes, the ouput JSON does change. To allow backwards compatibility this is an additive change. The shell scripts for starting & stopping workers are renamed, and for backwards compatibility old scripts are added to call through to the new ones while printing a deprecation message to stderr. ### How was this patch tested? Existing tests. Closes #28864 from holdenk/SPARK-32004-drop-references-to-slave. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-07-13 14:05:33 -07:00
angerszhu	6d499647b3	[SPARK-32105][SQL] Refactor current ScriptTransformationExec code ### What changes were proposed in this pull request? * Renamed hive transform scrip class `hive/execution/ScriptTransformationExec` to `hive/execution/HiveScriptTransformationExec` (don't rename file) * Extract class `BaseScriptTransformationExec ` about common code used across `SparkScriptTransformationExec(next pr add this)` and `HiveScriptTransformationExec` * Extract class `BaseScriptTransformationWriterThread` of writing data thread across `SparkScriptTransformationWriterThread(added next for support transform in sql/core )` and `HiveScriptTransformationWriterThread` , * `HiveScriptTransformationWriterThread` additionally supports Hive serde format * Rename current `Script` strategies in hive module to `HiveScript`, in next pr will add `SparkScript` strategies for support transform in sql/core. Todo List; - Support transform in sql/core base on `BaseScriptTransformationExec`, which would run script operator in SQL mode (without Hive). The output of script would be read as a string and column values are extracted by using a delimiter (default : tab character) - For Hive, by default only serde's must be used, and without hive we can run without serde - Cleanup past hacks that are observed (and people suggest / report)， such as - [Solve string value error about Date/Timestamp in ScriptTransform](https://issues.apache.org/jira/browse/SPARK-31947) - [support use transform with aggregation](https://issues.apache.org/jira/browse/SPARK-28227) - [support array/map as transform's input](https://issues.apache.org/jira/browse/SPARK-22435) - Use code-gen projection to serialize rows to output stream() ### Why are the changes needed? Support run transform in SQL mode without hive ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added UT Closes #27983 from AngersZhuuuu/follow_spark_15694. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-13 08:58:25 +00:00
HyukjinKwon	b84ed4146d	[SPARK-32245][INFRA] Run Spark tests in Github Actions ### What changes were proposed in this pull request? This PR aims to run the Spark tests in Github Actions. To briefly explain the main idea: - Reuse `dev/run-tests.py` with SBT build - Reuse the modules in `dev/sparktestsupport/modules.py` to test each module - Pass the modules to test into `dev/run-tests.py` directly via `TEST_ONLY_MODULES` environment variable. For example, `pyspark-sql,core,sql,hive`. - `dev/run-tests.py` _does not_ take the dependent modules into account but solely the specified modules to test. Another thing to note might be `SlowHiveTest` annotation. Running the tests in Hive modules takes too much so the slow tests are extracted and it runs as a separate job. It was extracted from the actual elapsed time in Jenkins: ![Screen Shot 2020-07-09 at 7 48 13 PM](https://user-images.githubusercontent.com/6477701/87050238-f6098e80-c238-11ea-9c4a-ab505af61381.png) So, Hive tests are separated into to jobs. One is slow test cases, and the other one is the other test cases. _Note that_ the current GitHub Actions build virtually copies what the default PR builder on Jenkins does (without other profiles such as JDK 11, Hadoop 2, etc.). The only exception is Kinesis https://github.com/apache/spark/pull/29057/files#diff-04eb107ee163a50b61281ca08f4e4c7bR23 ### Why are the changes needed? Last week and onwards, the Jenkins machines became very unstable for many reasons: - Apparently, the machines became extremely slow. Almost all tests can't pass. - One machine (worker 4) started to have the corrupt `.m2` which fails the build. - Documentation build fails time to time for an unknown reason in Jenkins machine specifically. This is disabled for now at https://github.com/apache/spark/pull/29017. - Almost all PRs are basically blocked by this instability currently. The advantages of using Github Actions: - To avoid depending on few persons who can access to the cluster. - To reduce the elapsed time in the build - we could split the tests (e.g., SQL, ML, CORE), and run them in parallel so the total build time will significantly reduce. - To control the environment more flexibly. - Other contributors can test and propose to fix Github Actions configurations so we can distribute this build management cost. Note that: - The current build in Jenkins takes _more than 7 hours_. With Github actions it takes _less than 2 hours_ - We can now control the environments especially for Python easily. - The test and build look more stable than the Jenkins'. ### Does this PR introduce _any_ user-facing change? No, dev-only change. ### How was this patch tested? Tested at https://github.com/HyukjinKwon/spark/pull/4 Closes #29057 from HyukjinKwon/migrate-to-github-actions. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-11 13:09:06 -07:00
Shixiong Zhu	c8779d9dfc	[SPARK-32256][SQL][TEST-HADOOP2.7] Force to initialize Hadoop VersionInfo in HiveExternalCatalog ### What changes were proposed in this pull request? Force to initialize Hadoop VersionInfo in HiveExternalCatalog to make sure Hive can get the Hadoop version when using the isolated classloader. ### Why are the changes needed? This is a regression in Spark 3.0.0 because we switched the default Hive execution version from 1.2.1 to 2.3.7. Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars to access Hive Metastore. These jars are loaded by the isolated classloader. Because we also share Hadoop classes with the isolated classloader, the user doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which means when we are using the isolated classloader, hadoop-common jar is not available in this case. If Hadoop VersionInfo is not initialized before we switch to the isolated classloader, and we try to initialize it using the isolated classloader (the current thread context classloader), it will fail and report `Unknown` which causes Hive to throw the following exception: ``` java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format) at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:219) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:67) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3349) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:217) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:204) at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:331) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:292) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:262) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:247) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175) at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:128) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71) at org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:63) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) ``` Technically, This is indeed an issue of Hadoop VersionInfo which has been fixed: https://issues.apache.org/jira/browse/HADOOP-14067. But since we are still supporting old Hadoop versions, we should fix it. Why this issue starts to happen in Spark 3.0.0? In Spark 2.4.x, we use Hive 1.2.1 by default. It will trigger `VersionInfo` initialization in the static codes of `Hive` class. This will happen when we load `HiveClientImpl` class because `HiveClientImpl.clent` method refers to `Hive` class. At this moment, the thread context classloader is not using the isolcated classloader, so it can access hadoop-common jar on the classpath and initialize it correctly. In Spark 3.0.0, we use Hive 2.3.7. The static codes of `Hive` class are not accessing `VersionInfo` because of the change in https://issues.apache.org/jira/browse/HIVE-11657. Instead, accessing `VersionInfo` happens when creating a `Hive` object (See the above stack trace). This happens here https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L260. But we switch to the isolated classloader before calling `HiveClientImpl.client` (See https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L283). This is exactly what I mentioned above: `If Hadoop VersionInfo is not initialized before we switch to the isolated classloader, and we try to initialize it using the isolated classloader (the current thread context classloader), it will fail` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new regression test added in this PR. Note that the new UT doesn't fail with the default profiles (-Phadoop-3.2) because it's already fixed at Hadoop 3.1. Please use the following to verify this. ``` build/sbt -Phadoop-2.7 -Phive "hive/testOnly *.HadoopVersionInfoSuite" ``` Closes #29059 from zsxwing/SPARK-32256. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-10 21:14:29 +09:00
HyukjinKwon	01e9dd9050	[SPARK-20680][SQL][FOLLOW-UP] Revert NullType.simpleString from 'unknown' to 'null' ### What changes were proposed in this pull request? This PR proposes to partially reverts the simple string in `NullType` at https://github.com/apache/spark/pull/28833: `NullType.simpleString` back from `unknown` to `null`. ### Why are the changes needed? - Technically speaking, it's orthogonal with the issue itself, SPARK-20680. - It needs some more discussion, see https://github.com/apache/spark/pull/28833#issuecomment-655277714 ### Does this PR introduce _any_ user-facing change? It reverts back the user-facing changes at https://github.com/apache/spark/pull/28833. The simple string of `NullType` is back to `null`. ### How was this patch tested? I just logically reverted. Jenkins should test it out. Closes #29041 from HyukjinKwon/SPARK-20680. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 19:44:08 -07:00
xiepengjie	523e238d2a	[SPARK-32192][SQL] Print column name when throws ClassCastException ### What changes were proposed in this pull request? When somebody changed the type of partition's field, spark will throw ClassCastException. For example, we have a table like this: ``` drop table if exists cast_exception_test; create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc; insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1'); ``` If you change the field's type in hive, query the old partition, spark will throw ClassCastException, but hive will not: ``` -- change the field's type using hive alter table cast_exception_test change column c1 c1 string; -- hive correct, but spark throws ClassCastException select * from cast_exception_test where dt='2020-04-08'; ``` ### Why are the changes needed? When the table has many fields, we don's known which field has been changed. If we print out log about this exception, it will very helpful for us to troubleshoot. ### Does this PR introduce _any_ user-facing change? When the ClassCastException is caused by changed field's type, you can search which field has problem in exexutor logs: ``` 20/04/09 17:22:05 ERROR hive.HadoopTableReader: Exception thrown in field <c1> ``` ### How was this patch tested? First, prepare the test data, the table is partitioned and stored as orc: ``` drop table if exists cast_exception_test; create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc; insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1'); ``` Then, change the field's type in hive. ``` alter table cast_exception_test change column c1 c1 string; ``` Now the metadata of the table has been modified, but the partition's metadata which is stored in orc file or hive metastore's mysql is still old. So, query command throws ClassCastException in spark, because spark use table's metadata which is different from orc file's metadata. But hive use partition's metadata which is the same as orc file's metadata. If you query the old partition, spark will thrown ClassCastException, but hive will not: ``` select * from cast_exception_test where dt='2020-04-08'; ``` Closes #29010 from StefanXiepj/SPARK-32192. Authored-by: xiepengjie <xiepengjie@didiglobal.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-09 09:33:54 -05:00
Erik Erlandson	1cb5bfc47a	[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-09 08:42:20 +00:00
LantaoJin	b5297c43b0	[SPARK-20680][SQL] Spark-sql do not support for creating table with void column datatype ### What changes were proposed in this pull request? This is the new PR which to address the close one #17953 1. support "void" primitive data type in the `AstBuilder`, point it to `NullType` 2. forbid creating tables with VOID/NULL column type ### Why are the changes needed? 1. Spark is incompatible with hive void type. When Hive table schema contains void type, DESC table will throw an exception in Spark. >hive> create table bad as select 1 x, null z from dual; >hive> describe bad; OK x int z void In Spark2.0.x, the behaviour to read this view is normal: >spark-sql> describe bad; x int NULL z void NULL Time taken: 4.431 seconds, Fetched 2 row(s) But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void >spark-sql> describe bad; 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] org.apache.spark.SparkException: Cannot recognize hive type string: void Caused by: org.apache.spark.sql.catalyst.parser.ParseException: DataType void() is not supported.(line 1, pos 0) == SQL == void ^^^ ... 61 more org.apache.spark.SparkException: Cannot recognize hive type string: void 2. Hive CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217 In Spark, creating table with a VOID/NULL column should throw readable exception message, include - create data source table (using parquet, json, ...) - create hive table (with or without stored as) - CTAS ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add unit tests Closes #28833 from LantaoJin/SPARK-20680_COPY. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 18:58:01 -07:00
TJX2014	0acad589e1	[SPARK-32156][SPARK-31061][TESTS][SQL] Refactor two similar test cases from in HiveExternalCatalogSuite ### What changes were proposed in this pull request? 1.Merge two similar tests for SPARK-31061 and make the code clean. 2.Fix table alter issue due to lose path. ### Why are the changes needed? Because this two tests for SPARK-31061 is very similar and could be merged. And the first test case should use `rawTable` instead of `parquetTable` to alter. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #28980 from TJX2014/master-follow-merge-spark-31061-test-case. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-02 10:15:10 -07:00
Wenchen Fan	6edb20df83	[SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in https://github.com/apache/spark/pull/28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-02 06:09:54 +08:00
Wenchen Fan	7dbd90b68d	[SPARK-31797][SQL][FOLLOWUP] TIMESTAMP_SECONDS supports fractional input ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/28534 , to make `TIMESTAMP_SECONDS` function support fractional input as well. ### Why are the changes needed? Previously the cast function can cast fractional values to timestamp. Now we suggest users to ues these new functions, and we need to cover all the cast use cases. ### Does this PR introduce _any_ user-facing change? Yes, now `TIMESTAMP_SECONDS` function accepts fractional input. ### How was this patch tested? new tests Closes #28956 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-01 08:03:46 -07:00
angerszhu	15fb5d7677	[SPARK-28169][SQL] Convert scan predicate condition to CNF ### What changes were proposed in this pull request? Spark can't push down scan predicate condition of Or: Such as if I have a table `default.test`, it's partition col is `dt`, If we use query : ``` select * from default.test where dt=20190625 or (dt = 20190626 and id in (1,2,3) ) ``` In this case, Spark will resolve Or condition as one expression, and since this expr has reference of "id", then it can't been push down. Base on pr https://github.com/apache/spark/pull/28733, In my PR , for SQL like `select * from default.test` `where dt = 20190626 or (dt = 20190627 and xxx="a") ` For this condition `dt = 20190626 or (dt = 20190627 and xxx="a" )`, it will been converted to CNF ``` (dt = 20190626 or dt = 20190627) and (dt = 20190626 or xxx = "a" ) ``` then condition `dt = 20190626 or dt = 20190627` will be push down when partition pruning ### Why are the changes needed? Optimize partition pruning ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #28805 from AngersZhuuuu/cnf-for-partition-pruning. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-01 12:00:15 +00:00
GuoPhilipse	f0e6d0ec13	[SPARK-31710][SQL] Fail casting numeric to timestamp by default ## What changes were proposed in this pull request? we fail casting from numeric to timestamp by default. ## Why are the changes needed? casting from numeric to timestamp is not a non-standard,meanwhile it may generate different result between spark and other systems,for example hive ## Does this PR introduce any user-facing change? Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS ## How was this patch tested? unit test added Closes #28593 from GuoPhilipse/31710-fix-compatibility. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-16 08:35:35 +00:00
GuoPhilipse	912d45df7c	[SPARK-31954][SQL] Delete duplicate testcase in HiveQuerySuite ### What changes were proposed in this pull request? remove duplicate test cases ### Why are the changes needed? improve test quality ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No test Closes #28782 from GuoPhilipse/31954-delete-duplicate-testcase. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-11 22:03:40 +09:00
turbofei	717ec5e9e3	[SPARK-29295][SQL][FOLLOWUP] Dynamic partition map parsed from partition path should be case insensitive ### What changes were proposed in this pull request? This is a follow up of https://github.com/apache/spark/pull/25979. When we inserting overwrite an external hive partitioned table with upper case dynamic partition key, exception thrown. like: ``` org.apache.spark.SparkException: Dynamic partition key P1 is not among written partition paths. ``` The root cause is that Hive metastore is not case preserving and keeps partition columns with lower cased names, see details in: `ddd8d5f5a0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (L895-L901)` `e28914095a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala (L228-L234)` In this PR, we convert the dynamic partition map to a case insensitive map. ### Why are the changes needed? To fix the issue when inserting overwrite into external hive partitioned table with upper case dynamic partition key. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. Closes #28765 from turboFei/SPARK-29295-follow-up. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-09 15:57:18 +00:00
LantaoJin	5079831106	[SPARK-31904][SQL] Fix case sensitive problem of char and varchar partition columns ### What changes were proposed in this pull request? ```sql CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet; CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1; SELECT * FROM t2 WHERE b = 'A'; ``` Above SQL throws MetaException > Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810) ... 114 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string, or integral types) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278) at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583) at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768) at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182) at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248) at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232) at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101) at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679) ... 119 more ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a unit test. Closes #28724 from LantaoJin/SPARK-31904. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-06-06 07:35:25 +09:00
lipzhu	d79a8a88b1	[SPARK-31834][SQL] Improve error message for incompatible data types ### What changes were proposed in this pull request? We should use dataType.catalogString to unified the data type mismatch message. Before: ```sql spark-sql> create table SPARK_31834(a int) using parquet; spark-sql> insert into SPARK_31834 select '1'; Error in query: Cannot write incompatible data to table '`default`.`spark_31834`': - Cannot safely cast 'a': StringType to IntegerType; ``` After: ```sql spark-sql> create table SPARK_31834(a int) using parquet; spark-sql> insert into SPARK_31834 select '1'; Error in query: Cannot write incompatible data to table '`default`.`spark_31834`': - Cannot safely cast 'a': string to int; ``` ### How was this patch tested? UT. Closes #28654 from lipzhu/SPARK-31834. Authored-by: lipzhu <lipzhu@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-02 21:07:10 +09:00
Maryann Xue	45864faaf2	[SPARK-31862][SQL] Remove exception wrapping in AQE ### What changes were proposed in this pull request? This PR removes the excessive exception wrapping in AQE so that error messages are less verbose and mostly consistent with non-aqe execution. Exceptions from stage materialization are now only wrapped with `SparkException` if there are multiple stage failures. Also, stage cancelling errors will not be included as part the exception thrown, but rather just be error logged. ### Why are the changes needed? This will make the AQE error reporting more readable and debuggable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #28668 from maryannxue/spark-31862. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-29 04:23:38 +00:00
Kent Yao	1f29f1ba58	[SPARK-31684][SQL] Overwrite partition failed with 'WRONG FS' when the target partition is not belong to the filesystem as same as the table ### What changes were proposed in this pull request? With SPARK-18107, we will disable the underlying replace(overwrite) and instead do delete in spark side and only do copy in hive side to bypass the performance issue - [HIVE-11940](https://issues.apache.org/jira/browse/HIVE-11940) Conditionally, if the table location and partition location do not belong to the same `FileSystem`, We should not disable hive overwrite. Otherwise, hive will use the `FileSystem` instance belong to the table location to copy files, which will fail in `FileSystem#checkPath` https://github.com/apache/hive/blob/rel/release-2.3.7/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1657 In this PR, for Hive 2.0.0 and onwards, as [HIVE-11940](https://issues.apache.org/jira/browse/HIVE-11940) has been fixed, and there is no performance issue anymore. We should leave the overwrite logic to hive to avoid failure in `FileSystem#checkPath` NOTE THAT For Hive 2.2.0 and earlier, if the table and partition locations do not belong together, we will still get the same error thrown by hive encryption check due to [HIVE-14380]( https://issues.apache.org/jira/browse/HIVE-14380) which need to fix in another ticket SPARK-31675. ### Why are the changes needed? bugfix. a logic table can be decoupled with the storage layer and may contain data from remote storage systems. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Currently verified manually. add benchmark tests ```sql -INSERT INTO DYNAMIC 7742 7918 248 0.0 756044.0 1.0X -INSERT INTO HYBRID 1289 1307 26 0.0 125866.3 6.0X -INSERT INTO STATIC 371 393 38 0.0 36219.4 20.9X -INSERT OVERWRITE DYNAMIC 8456 8554 138 0.0 825790.3 0.9X -INSERT OVERWRITE HYBRID 1303 1311 12 0.0 127198.4 5.9X -INSERT OVERWRITE STATIC 434 447 13 0.0 42373.8 17.8X +INSERT INTO DYNAMIC 7382 7456 105 0.0 720904.8 1.0X +INSERT INTO HYBRID 1128 1129 1 0.0 110169.4 6.5X +INSERT INTO STATIC 349 370 39 0.0 34095.4 21.1X +INSERT OVERWRITE DYNAMIC 8149 8362 301 0.0 795821.8 0.9X +INSERT OVERWRITE HYBRID 1317 1318 2 0.0 128616.7 5.6X +INSERT OVERWRITE STATIC 387 408 37 0.0 37804.1 19.1X ``` + for master - for this PR both using hive 2.3.7 Closes #28511 from yaooqinn/SPARK-31684. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-19 14:08:51 +00:00
Jungtaek Lim (HeartSaVioR)	d2bec5e265	[SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? This patch effectively reverts SPARK-30098 via below changes: * Removed the config * Removed the changes done in parser rule * Removed the usage of config in tests * Removed tests which depend on the config * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098 * Reflect the change into docs (migration doc, create table syntax) ### Why are the changes needed? SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change. Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details. ### Does this PR introduce _any_ user-facing change? No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases. ### How was this patch tested? Existing UTs. Closes #28517 from HeartSaVioR/revert-SPARK-30098. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:27:23 +00:00
Max Gekk	5539ecfdac	[SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default ### What changes were proposed in this pull request? Set default time zone and locale in the default constructor of `SparkFunSuite`: - Default time zone to `America/Los_Angeles` - Default locale to `Locale.US` ### Why are the changes needed? 1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite 2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running all affected test suites Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:26:00 +00:00
Dongjoon Hyun	5d90886523	[SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite # What changes were proposed in this pull request? This PR aims to provide a fallback version instead of `Nil` in `HiveExternalCatalogVersionsSuite`. The provided fallback Spark versions recovers Jenkins jobs instead of failing. ### Why are the changes needed? Currently, `HiveExternalCatalogVersionsSuite` is aborted in all Jenkins jobs except JDK11 Jenkins jobs which don't have old Spark releases supporting JDK11. ``` HiveExternalCatalogVersionsSuite: org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite * ABORTED * Exception encountered when invoking run on a nested suite - Fail to get the lates Spark versions to test. (HiveExternalCatalogVersionsSuite.scala:180) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins Closes #28536 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-15 00:30:25 -07:00
sunke.03	ddbce4edee	[SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … ### What changes were proposed in this pull request? This PR try to fix a bug in `org.apache.spark.sql.hive.execution.ScriptTransformationExec`. This bug appears in our online cluster. `ScriptTransformationExec` should throw an exception, when user uses a python script which contains parse error. But current implementation may miss this case of failure. ### Why are the changes needed? When user uses a python script which contains a parse error, there will be no output. So `scriptOutputReader.next(scriptOutputWritable) <= 0` matches, then we use `checkFailureAndPropagate()` to check the `proc`. But the `proc` may still be alive and `writerThread.exception` is not defined, `checkFailureAndPropagate` cannot check this case of failure. In the end, the Spark SQL job runs successfully and returns no result. In fact, the SparK SQL job should fails and shows the exception properly. For example, the error python script is blow. ``` python # encoding: utf8 import unknow_module import sys for line in sys.stdin: print line ``` The bug can be reproduced by running the following code in our cluter. ``` spark.range(100*100).toDF("index").createOrReplaceTempView("test") spark.sql("select TRANSFORM(index) USING 'python error_python.py' as new_index from test").collect.foreach(println) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT Closes #27724 from slamke/transformation. Authored-by: sunke.03 <sunke.03@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-14 13:55:24 +00:00
Wenchen Fan	fd2d55c991	[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files ### What changes were proposed in this pull request? When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not. ### Why are the changes needed? Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values. ### Does this PR introduce _any_ user-facing change? Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config. ### How was this patch tested? updated tests Closes #28477 from cloud-fan/rebase. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-14 12:32:40 +09:00
Max Gekk	a3fafddf39	[SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator ### What changes were proposed in this pull request? Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`. ### Why are the changes needed? To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`. Closes #28502 from MaxGekk/random-java8-datetime. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-12 14:05:31 +00:00
Max Gekk	32a5398b65	[SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps ### What changes were proposed in this pull request? Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType. ### Why are the changes needed? To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`. Closes #28481 from MaxGekk/test-random-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-11 12:59:41 +00:00
turbofei	8d1f7d2a4a	[SPARK-31467][SQL][TEST] Refactor the sql tests to prevent TableAlreadyExistsException ### What changes were proposed in this pull request? If we add UT in hive/SQLQuerySuite or other sql test suites and use table named `test`. We may meet TableAlreadyExistsException. ``` org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default' ``` The reason is that, there is some tests that does not clean up the tables/views. In this PR, I add `withTempViews` for these tests. ### Why are the changes needed? To fix the TableAlreadyExistsException issue when adding an UT, which uses table named `test` or others, in some sql test suites, such as hive/SQLQuerySuite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed UT. Closes #28239 from turboFei/SPARK-31467. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-05 15:14:33 +09:00
Terry Kim	36803031e8	[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views ### What changes were proposed in this pull request? This PR addresses two things: - `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921) - `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`. ### Why are the changes needed? It's a bug. ### Does this PR introduce any user-facing change? Yes, now `SHOW TBLPROPERTIES` works on views: ``` scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES view").show(truncate=false) +---------------------------------+-------------+ \|key \|value \| +---------------------------------+-------------+ \|view.catalogAndNamespace.numParts\|2 \| \|view.query.out.col.0 \|c1 \| \|view.query.out.numCols \|1 \| \|p2 \|v2 \| \|view.catalogAndNamespace.part.0 \|spark_catalog\| \|p1 \|v1 \| \|view.catalogAndNamespace.part.1 \|default \| +---------------------------------+-------------+ ``` And for a temporary view: ``` scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false) +---+-----+ \|key\|value\| +---+-----+ +---+-----+ ``` ### How was this patch tested? Added tests. Closes #28375 from imback82/show_tblproperties_followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 07:06:45 +00:00
Dongjoon Hyun	79eaaaf6da	[SPARK-31580][BUILD] Upgrade Apache ORC to 1.5.10 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.5.10. ### Why are the changes needed? Apache ORC 1.5.10 is a maintenance release with the following patches. - [ORC-621](https://issues.apache.org/jira/browse/ORC-621) Need reader fix for ORC-569 - [ORC-616](https://issues.apache.org/jira/browse/ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte - [ORC-613](https://issues.apache.org/jira/browse/ORC-613) OrcMapredRecordReader mis-reuse struct object when actual children schema differs - [ORC-610](https://issues.apache.org/jira/browse/ORC-610) Updated Copyright year in the NOTICE file The following is release note. - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12346912 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing ORC tests and a newly added test case. - The first commit is already tested in `hive-2.3` profile with both native ORC implementation and Hive 2.3 ORC implementation. (https://github.com/apache/spark/pull/28373#issuecomment-620265114) - The latest run is about to make the test case disable in `hive-1.2` profile which doesn't use Apache ORC. - `hive-1.2`: https://github.com/apache/spark/pull/28373#issuecomment-620325906 Closes #28373 from dongjoon-hyun/SPARK-ORC-1.5.10. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 18:56:30 -07:00
Kent Yao	8dc2c0247b	[SPARK-31522][SQL] Hive metastore client initialization related configurations should be static ### What changes were proposed in this pull request? HiveClient instance is cross-session, the following configurations which are defined in HiveUtils and used to create it should be considered static: 1. spark.sql.hive.metastore.version - used to determine the hive version in Spark 2. spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client 3. spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes - package names of classes that are shared or separated between SparkContextLoader and hive client class loader Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command. Speaking of spark.sql.hive.version - the fake of the spark.sql.hive.metastore.version, it is used by jdbc/thrift client for backward compatibility. ### Why are the changes needed? bugfix, these configurations should not be changed. ### Does this PR introduce any user-facing change? Yes, the following set of configs are not allowed to change. ``` Seq("spark.sql.hive.metastore.version ", "spark.sql.hive.metastore.jars", "spark.sql.hive.metastore.sharedPrefixes", "spark.sql.hive.metastore.barrierPrefixes") ``` ### How was this patch tested? add unit test Closes #28302 from yaooqinn/SPARK-31522. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 15:07:44 +00:00
Yuming Wang	b11e42663b	[SPARK-31381][SPARK-29245][SQL] Upgrade built-in Hive 2.3.6 to 2.3.7 ### What changes were proposed in this pull request? Hive 2.3.7 fixed these issues: - HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 or newer - HIVE-21980:Parsing time can be high in case of deeply nested subqueries - HIVE-22249: Support Parquet through HCatalog ### Why are the changes needed? Fix CCE during creating HiveMetaStoreClient in JDK11 environment: [SPARK-29245](https://issues.apache.org/jira/browse/SPARK-29245). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - [x] Test Jenkins with Hadoop 2.7 (https://github.com/apache/spark/pull/28148#issuecomment-616757840) - [x] Test Jenkins with Hadoop 3.2 on JDK11 (https://github.com/apache/spark/pull/28148#issuecomment-616294353) - [x] Manual test with remote hive metastore. Hive side: ``` export JAVA_HOME=/usr/lib/jdk1.8.0_221 export PATH=$JAVA_HOME/bin:$PATH cd /usr/lib/hive-2.3.6 # Start Hive metastore with Hive 2.3.6 bin/schematool -dbType derby -initSchema --verbose bin/hive --service metastore ``` Spark side: ``` export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true bin/spark-sql --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 ``` Closes #28148 from wangyum/SPARK-31381. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 13:38:24 -07:00
yi.wu	40f9dbb628	[SPARK-31425][SQL][CORE] UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should also respect UnsafeAlignedOffset ### What changes were proposed in this pull request? Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts. ### Why are the changes needed? Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue. Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8` to simulate SPARC platform. And tests only pass with this fix. Closes #28195 from Ngone51/fix_uao. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-17 04:48:27 +00:00
yi.wu	bbb3cd9c5e	[SPARK-31391][SQL][TEST] Add AdaptiveTestUtils to ease the test of AQE ### What changes were proposed in this pull request? This PR adds `AdaptiveTestUtils` to make AQE test simpler, which includes: `DisableAdaptiveExecution` - a test tag to skip a single test case if AQE is enabled. `EnableAdaptiveExecutionSuite` - a helper trait to enable AQE for all tests except those tagged with `DisableAdaptiveExecution`. `DisableAdaptiveExecutionSuite` - a helper trait to disable AQE for all tests. `assertExceptionMessage` - a method to handle message of normal or AQE exception in a consistent way. `assertExceptionCause` - a method to handle cause of normal or AQE exception in a consistent way. ### Why are the changes needed? With this utils, we can: - reduce much more duplicate codes; - handle normal or AQE exception in a consistent way; - improve the stability of AQE tests; ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated tests with the util. Closes #28162 from Ngone51/add_aqe_test_utils. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-13 14:40:53 +00:00
yi.wu	4de8ae1a0f	[SPARK-31407][SQL][TEST] TestHiveQueryExecution should respect database when creating table ### What changes were proposed in this pull request? In `TestHiveQueryExecution`, if we detect a database in the referenced table, we should create the table under that database. ### Why are the changes needed? This fix the test `Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q` which currently only pass when we run it with the whole test suit but fail when run it separately. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Run the test separately and together with the whole test suite. Closes #28177 from Ngone51/fix_derived. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-13 19:04:36 +09:00
Eric Wu	a28ed86a38	[SPARK-31113][SQL] Add SHOW VIEWS command ### What changes were proposed in this pull request? Previously, user can issue `SHOW TABLES` to get info of both tables and views. This PR (SPARK-31113) implements `SHOW VIEWS` SQL command similar to HIVE to get views only.(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews) Hive -- Only show view names ``` hive> SHOW VIEWS; OK view_1 view_2 ... ``` Spark(Hive-Compatible) -- Only show view names, used in tests and `SparkSQLDriver` for CLI applications ``` SHOW VIEWS IN showdb; view_1 view_2 ... ``` Spark -- Show more information database/viewName/isTemporary ``` spark-sql> SHOW VIEWS; userdb view_1 false userdb view_2 false ... ``` ### Why are the changes needed? `SHOW VIEWS` command provides better granularity to only get information of views. ### Does this PR introduce any user-facing change? Add new `SHOW VIEWS` SQL command ### How was this patch tested? Add new test `show-views.sql` and pass existing tests Closes #27897 from Eric5553/ShowViews. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-07 09:25:01 -07:00
Liang-Chi Hsieh	d782a1c456	[SPARK-31224][SQL] Add view support to SHOW CREATE TABLE ### What changes were proposed in this pull request? For now `SHOW CREATE TABLE` command doesn't support views, but `SHOW CREATE TABLE AS SERDE` supports it. Since the views syntax are the same between Hive DDL and Spark DDL, we should be able to support views in both two commands. This is Hive syntax for creating views: ``` CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value, ...)] AS SELECT ...; ``` https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateView This is Spark syntax for creating views: ``` CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [IF NOT EXISTS [db_name.]view_name create_view_clauses AS query; ``` https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-create-view.html Looks like it is the same. We could support views in both commands. This patch proposes to add views support to `SHOW CREATE TABLE`. ### Why are the changes needed? To extend the view support of `SHOW CREATE TABLE`, so users can use `SHOW CREATE TABLE` to show Spark DDL for views. ### Does this PR introduce any user-facing change? Yes. `SHOW CREATE TABLE` can be used to show Spark DDL for views. ### How was this patch tested? Unit tests. Closes #27984 from viirya/spark-view. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-06 05:34:59 +00:00
Jungtaek Lim (HeartSaVioR)	2a6aa8e87b	[SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper ### What changes were proposed in this pull request? This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now. It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance. This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests. Credit to cloud-fan as he discovered the problem and proposed the solution. ### Why are the changes needed? Above section describes why it's a bug and how it's fixed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UTs added. Closes #28079 from HeartSaVioR/SPARK-31312. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-31 16:17:26 +00:00
beliefer	47c810f8ae	[SPARK-31279][SQL][DOC] Add version information to the configuration of Hive ### What changes were proposed in this pull request? Add version information to the configuration of `Hive`. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.hive.metastore.version \| 1.4.0 \| SPARK-6908 \| 05454fd8aef75b129cbbd0288f5089c5259f4a15#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.version \| 1.1.1 \| SPARK-3971 \| 64945f868443fbc59cb34b34c16d782dda0fb63d#diff-12fa2178364a810b3262b30d8d48aa2d \| spark.sql.hive.metastore.jars \| 1.4.0 \| SPARK-6908 \| 05454fd8aef75b129cbbd0288f5089c5259f4a15#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.convertMetastoreParquet \| 1.1.1 \| SPARK-2406 \| cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.convertMetastoreParquet.mergeSchema \| 1.3.1 \| SPARK-6575 \| 778c87686af0c04df9dfe144b8f744f271a988ad#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.convertMetastoreOrc \| 2.0.0 \| SPARK-14070 \| 1e886159849e3918445d3fdc3c4cef86c6c1a236#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.convertInsertingPartitionedTable \| 3.0.0 \| SPARK-28573 \| d5688dc732890923c326f272b0c18c329a69459a#diff-842e3447fc453de26c706db1cac8f2c4 \| spark.sql.hive.convertMetastoreCtas \| 3.0.0 \| SPARK-25271 \| 5ad03607d1487e7ab3e3b6d00eef9c4028ed4975#diff-842e3447fc453de26c706db1cac8f2c4 \| spark.sql.hive.metastore.sharedPrefixes \| 1.4.0 \| SPARK-7491 \| a8556086d33cb993fab0ae2751e31455e6c664ab#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.metastore.barrierPrefixes \| 1.4.0 \| SPARK-7491 \| a8556086d33cb993fab0ae2751e31455e6c664ab#diff-ff50aea397a607b79df9bec6f2a841db \| spark.sql.hive.thriftServer.async \| 1.5.0 \| SPARK-6964 \| eb19d3f75cbd002f7e72ce02017a8de67f562792#diff-ff50aea397a607b79df9bec6f2a841db \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT Closes #28042 from beliefer/add-version-to-hive-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-31 12:35:01 +09:00
gatorsmile	b9eafcb526	[SPARK-31088][SQL] Add back HiveContext and createExternalTable ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - HiveContext - createExternalTable APIs ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? add a new test suite for createExternalTable APIs. Closes #27815 from gatorsmile/addAPIsBack. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-26 23:51:15 -07:00
DB Tsai	cb0db21373	[SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][TEST-HIVE1.2] Nested Column Predicate Pushdown for Parquet ### What changes were proposed in this pull request? 1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions. 2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet. ### Why are the changes needed? Better performance for handling nested predicate pushdown. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New tests are added. Closes #27728 from dbtsai/SPARK-17636. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 14:28:57 +08:00
Kent Yao	8be16907c2	[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir ### What changes were proposed in this pull request? In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps. Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`. This PR fixes CLiSuite failure with the hive-1.2 profile in https://github.com/apache/spark/pull/27933. In https://github.com/apache/spark/pull/27933, we fix the issue in JIRA by deciding the warehouse dir using all properties from spark conf and Hadoop conf, but properties from `--hiveconf` is not included, they will be applied to the `CliSessionState` instance after it initialized. When this command-line option key is `hive.metastore.warehouse.dir`, the actual warehouse dir is overridden. Because of the logic in Hive for creating the non-existing default database changed, that test passed with `Hive 2.3.6` but failed with `1.2`. So in this PR, Hadoop/Hive configurations are ordered by: ` spark.hive.xxx > spark.hadoop.xxx > --hiveconf xxx > hive-site.xml` througth `ShareState.loadHiveConfFile` before sessionState start ### Why are the changes needed? Bugfix for Spark SQL CLI to pick right confs ### Does this PR introduce any user-facing change? yes, 1. the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified. 2. configurations from `hive-site.xml` will not override command-line options or the properties defined with `spark.hadoo(hive).` prefix in spark conf. ### How was this patch tested? add cli ut Closes #27969 from yaooqinn/SPARK-31170-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 12:05:45 +08:00
Maxim Gekk	d72ec85741	[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource ### What changes were proposed in this pull request? This PR (SPARK-31238) aims the followings. 1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by https://github.com/apache/spark/pull/27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date. 2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR https://github.com/apache/spark/pull/27890 (and fixed by https://github.com/apache/spark/pull/27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. ### Does this PR introduce any user-facing change? Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following: ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ \|dt \| +----------+ \|1200-01-08\| +----------+ ``` After the changes ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ \|dt \| +----------+ \|1200-01-01\| +----------+ ``` ### How was this patch tested? - By running `OrcSourceSuite` and `HiveOrcSourceSuite`. - Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ \|dt \| +----------+ \|1200-01-01\| +----------+ ``` - Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write. Closes #28016 from MaxGekk/rebase-date-orc. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-26 13:14:28 -07:00
Wenchen Fan	4f274a4de9	[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables ### What changes were proposed in this pull request? Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors Apache Spark 3.0.0-preview2 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` Apache Spark 2.4.5 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. ### Why are the changes needed? avoid confusing/wrong behavior ### Does this PR introduce any user-facing change? yes, now users can't create/alter non-Hive tables with CHAR type. ### How was this patch tested? new tests Closes #27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-25 09:25:55 -07:00
Maxim Gekk	1fd4607d84	[SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types ### What changes were proposed in this pull request? In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar. ### Why are the changes needed? The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8. ### Does this PR introduce any user-facing change? Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate. ### How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`. Closes #27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 21:14:25 +08:00
LantaoJin	929b794e25	[SPARK-30494][SQL] Fix cached data leakage during replacing an existing view ### What changes were proposed in this pull request? The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak. We can reproduce by below commands: ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create or replace temporary view temp1 as select 1") scala> spark.sql("cache table temp1") scala> spark.sql("create or replace temporary view temp1 as select 1, 2") scala> spark.sql("cache table temp1") scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined) scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined) ``` ### Why are the changes needed? Fix the memory leak, specially for long running mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add an unit test. Closes #27185 from LantaoJin/SPARK-30494. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-22 22:22:13 -07:00
Eric Wu	3a48ea1fe0	[SPARK-31184][SQL] Support getTablesByType API of Hive Client ### What changes were proposed in this pull request? Hive 2.3+ supports `getTablesByType` API, which will provide an efficient way to get HiveTable with specific type. Now, we have following mappings when using `HiveExternalCatalog`. ``` CatalogTableType.EXTERNAL => HiveTableType.EXTERNAL_TABLE CatalogTableType.MANAGED => HiveTableType.MANAGED_TABLE CatalogTableType.VIEW => HiveTableType.VIRTUAL_VIEW ``` Without this API, we need to achieve the goal by `getTables` + `getTablesByName` + `filter with type`. This PR add `getTablesByType` in `HiveShim`. For those hive versions don't support this API, `UnsupportedOperationException` will be thrown. And the upper logic should catch the exception and fallback to the filter solution mentioned above. Since the JDK11 related fix in `Hive` is not released yet, manual tests against hive 2.3.7-SNAPSHOT is done by following the instructions of SPARK-29245. ### Why are the changes needed? This API will provide better usability and performance if we want to get a list of hiveTables with specific type. For example `HiveTableType.VIRTUAL_VIEW` corresponding to `CatalogTableType.VIEW`. ### Does this PR introduce any user-facing change? No, this is a support function. ### How was this patch tested? Add tests in VersionsSuite and manually run JDK11 test with following settings: - Hive 2.3.6 Metastore on JDK8 - Hive 2.3.7-SNAPSHOT library build from source of Hive 2.3 branch - Spark build with Hive 2.3.7-SNAPSHOT on jdk-11.0.6 Closes #27952 from Eric5553/GetTableByType. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 17:41:23 -07:00
Maxim Gekk	6a668763b8	[SPARK-31195][SQL] Correct and reuse days rebase functions of `DateTimeUtils` in `DaysWritable` ### What changes were proposed in this pull request? In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`. ### Why are the changes needed? 0. Existing rebasing of days in `DaysWritable` is not correct. 1. To deduplicate code in `DaysWritable` 2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5 ### Does this PR introduce any user-facing change? This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`. ### How was this patch tested? By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`. Closes #27962 from MaxGekk/reuse-rebase-funcs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-20 15:57:21 +09:00
Dongjoon Hyun	f1cc86792f	[SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases ### What changes were proposed in this pull request? A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side. The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR. ### Why are the changes needed? This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 12:28:57 +08:00
Dongjoon Hyun	c6a6d5e006	Revert "[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir" This reverts commit `5bc0d76591`. Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 16:08:51 -07:00
Kent Yao	5bc0d76591	[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir ### What changes were proposed in this pull request? In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps. Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`. ### Why are the changes needed? fix bug for Spark SQL cli to pick right confs ### Does this PR introduce any user-facing change? yes, the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified. ### How was this patch tested? add cli ut Closes #27933 from yaooqinn/SPARK-31170. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 23:03:18 +08:00
Maxim Gekk	57854c736c	[SPARK-31076][SQL][FOLLOWUP] Incapsulate date rebasing to `DaysWritable` ### What changes were proposed in this pull request? Move the code related to days rebasing from/to Julian calendar from `HiveInspectors` to new class `DaysWritable`. ### Why are the changes needed? To improve maintainability of the `HiveInspectors` trait which is already pretty complex. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `HiveOrcHadoopFsRelationSuite`. Closes #27890 from MaxGekk/replace-DateWritable-by-DaysWritable. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-16 17:06:15 +08:00
Kent Yao	18f2730874	[SPARK-31066][SQL][TEST-HIVE1.2] Disable useless and uncleaned hive SessionState initialization parts ### What changes were proposed in this pull request? As a common usage and according to the spark doc, users may often just copy their `hive-site.xml` to Spark directly from hive projects. Sometimes, the config file is not that clean for spark and may cause some side effects. for example, `hive.session.history.enabled` will create a log for the hive jobs but useless for spark and also it will not be deleted on JVM exit. this pr 1) disable `hive.session.history.enabled` explicitly to disable creating `hive_job_log` file, e.g. ``` Hive history file=/var/folders/01/h81cs4sn3dq2dd_k4j6fhrmc0000gn/T//kentyao/hive_job_log_79c63b29-95a4-4935-a9eb-2d89844dfe4f_493861201.txt ``` 2) set `hive.execution.engine` to `spark` explicitly in case the config is `tez` and casue uneccesary problem like this: ``` Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529) ``` ### Why are the changes needed? reduce overhead of internal complexity and users' hive cognitive load for running spark ### Does this PR introduce any user-facing change? yes, `hive_job_log` file will not be created even enabled, and will not try to initialize tez kinds of stuff ### How was this patch tested? add ut and verify manually Closes #27827 from yaooqinn/SPARK-31066. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-12 18:13:52 +08:00
Maxim Gekk	3d3e366aa8	[SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time ### What changes were proposed in this pull request? In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components. ### Why are the changes needed? This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar. Here is the example which demonstrates the issue: ```sql scala> sql("select date '1100-10-10'").collect() res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) ``` ### Does this PR introduce any user-facing change? Yes, after the changes: ```sql scala> sql("select date '1100-10-10'").collect() res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10]) ``` ### How was this patch tested? By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #27807 from MaxGekk/rebase-timestamp-before-1582. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 20:53:56 +08:00
Burak Yavuz	2e3adadc6a	[SPARK-31061][SQL] Provide ability to alter the provider of a table This PR adds functionality to HiveExternalCatalog to be able to change the provider of a table. This is useful for catalogs in Spark 3.0 to be able to use alterTable to change the provider of a table as part of an atomic REPLACE TABLE function. No Unit tests Closes #27822 from brkyvz/externalCat. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 23:42:07 -08:00
Terry Kim	66b4fd040e	[SPARK-31024][SQL] Allow specifying session catalog name `spark_catalog` in qualified column names for v1 tables ### What changes were proposed in this pull request? Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` fails with `cannot resolve 'spark_catalog.default.t.i`. This is inconsistent with v2 table behavior where catalog name can be used: ``` SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id ``` This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables. ### Why are the changes needed? Fixing an inconsistent behavior. ### Does this PR introduce any user-facing change? Yes, now the following query works: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` ### How was this patch tested? Added new tests. Closes #27776 from imback82/spark_catalog_col_name_resolution. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 18:33:59 +08:00
Kent Yao	1fac06c430	Revert "[SPARK-30808][SQL] Enable Java 8 time API in Thrift server" This reverts commit `afaeb29599`. ### What changes were proposed in this pull request? Based on the result and comment from https://github.com/apache/spark/pull/27552#discussion_r385531744 In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-03 14:21:20 +08:00
Josh Rosen	f0010c81e2	[SPARK-31003][TESTS] Fix incorrect uses of assume() in tests ### What changes were proposed in this pull request? This patch fixes several incorrect uses of `assume()` in our tests. If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable). In contrast, `assert(condition)` will fail the test if the condition doesn't hold. If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test. This patch fixes several such cases, replacing certain `assume()` calls with `assert()`. Credit to ahirreddy for spotting this problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27754 from JoshRosen/fix-assume-vs-assert. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-02 15:20:45 -08:00
yi.wu	a1d2ce90b0	[SPARK-30972][SQL] PruneHiveTablePartitions should be executed as earlyScanPushDownRules ### What changes were proposed in this pull request? Make rule `PruneHiveTablePartitions` to execute as `earlyScanPushDownRules`. ### Why are the changes needed? Similar to rule `PruneFileSourcePartitions`, `PruneHiveTablePartitions` should also be executed as earlyScanPushDownRules to eliminate the impact on statistic computation later. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27723 from Ngone51/early_hive_prune. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 11:50:20 +08:00
Takeshi Yamamuro	314442a3d5	[SQL][MINOR][TESTS] Remove GivenWhenThen trait from HiveComparisonTest ### What changes were proposed in this pull request? This pr intends to remove non-used trait, `GivenWhenThen`, from `HiveComparisonTest`. ### Why are the changes needed? For better code. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #27726 from maropu/MINOR-20200228. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-27 17:19:27 -08:00
yi.wu	22dfd15a45	[SPARK-30937][DOC] Group Hive upgrade guides together ### What changes were proposed in this pull request? This PR groups all hive upgrade related migration guides inside Spark 3.0 together. Also add another behavior change of `ScriptTransform` in the new Hive section. ### Why are the changes needed? Make the doc more clearly to user. ### Does this PR introduce any user-facing change? No, new doc for Spark 3.0. ### How was this patch tested? N/A. Closes #27670 from Ngone51/hive_migration. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 21:29:42 +08:00
HyukjinKwon	cd3ef2249f	[SPARK-30906][SQL][TESTS][FOLLOW-UP] Set the configuration against TestHive explicitly in HiveSerDeSuite ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/27659 (see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/253/), the tests below fail consistently, specifically in one job https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/ in Jenkins ``` org.apache.spark.sql.hive.execution.HiveSerDeSuite.Test the default fileformat for Hive-serde tables ``` The profile is same as PR builder but seems it fails specifically in this machine. Several configurations used in `HiveSerDeSuite` are not being set presumably due to the inconsistency between `SQLConf.get` and the active Spark session described in the https://github.com/apache/spark/pull/27387, and as a side effect of the cloned session at https://github.com/apache/spark/pull/27659. This PR proposes to explicitly set the configuration against `TestHive` by using `withExistingConf` at `withSQLConf` ### Why are the changes needed? To make `spark-master-test-sbt-hadoop-2.7-hive-2.3` job pass. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Cannot reproduce in my local. Presumably it cannot be reproduced in the PR builder. We should see if the tests pass at `spark-master-test-sbt-hadoop-2.7-hive-2.3` job after this PR is merged. Closes #27705 from HyukjinKwon/SPARK-30906. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-26 18:01:26 -08:00
HyukjinKwon	020b2622e5	[SPARK-30798][SQL][TESTS][FOLLOW-UP] Set the configuration against the current session explicitly in HiveShowCreateTableSuite ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/27387 (see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/202/), the tests below fail consistently, specifically in one job https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/ in Jenkins ``` org.apache.spark.sql.hive.HiveShowCreateTableSuite.simple hive table org.apache.spark.sql.hive.HiveShowCreateTableSuite.simple external hive table org.apache.spark.sql.hive.HiveShowCreateTableSuite.hive bucketing is supported ``` The profile is same as PR builder but seems it fails specifically in this machine. Seems the legacy configuration `spark.sql.legacy.createHiveTableByDefault.enabled` is not being set due to the inconsistency between `SQLConf.get` and the active Spark session as described in the https://github.com/apache/spark/pull/27387. This PR proposes to explicitly set the configuration against the session used instead of `SQLConf.get`. ### Why are the changes needed? To make `spark-master-test-sbt-hadoop-2.7-hive-2.3` job pass. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Cannot reproduce in my local. Presumably it cannot be reproduced in the PR builder. We should see if the tests pass at `spark-master-test-sbt-hadoop-2.7-hive-2.3` job after this PR is merged Closes #27703 from HyukjinKwon/SPARK-30798-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-26 20:48:43 +09:00
gatorsmile	28b8713036	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT ### What changes were proposed in this pull request? This patch is to bump the master branch version to 3.1.0-SNAPSHOT. ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? N/A ### How was this patch tested? N/A Closes #27698 from gatorsmile/updateVersion. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-25 19:44:31 -08:00
Terry Kim	0fd4fa70c8	[SPARK-30885][SQL] V1 table name should be fully qualified if catalog name is provided ### What changes were proposed in this pull request? For the following: ``` CREATE TABLE t USING json AS SELECT 1 AS i SELECT * FROM spark_catalog.t ``` `spark_catalog.t` is resolved to `spark_catalog.default.t` assuming the current namespace is `default`. However, this is not consistent with V2 behavior where the namespace must be specified if the catalog name is provided. This PR proposes to fix this inconsistency. ### Why are the changes needed? To be consistent with V2 table naming scheme in SQL commands. ### Does this PR introduce any user-facing change? Yes, now the user has to specify the namespace if the catalog name is provided. For example, ``` SELECT * FROM spark_catalog.t # Will throw AnalysisException with 'Session catalog cannot have an empty namespace: spark_catalog.t' SELECT * FROM spark_catalog.default.t # OK ``` ### How was this patch tested? Added new tests Closes #27642 from imback82/disallow_spark_catalog_wihtout_db. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 13:04:28 +08:00
lijunqing	31bfbaeb90	[SPARK-30868][SQL] Throw Exception if runHive(sql) failed ### Why are the changes needed? At present, HiveClientImpl.runHive will not throw an exception when it runs incorrectly, which will cause it to fail to feedback error information normally. Example ```scala spark.sql("add jar file:///tmp/not_exists.jar") spark.sql("show databases").show() ``` /tmp/not_exists.jar doesn't exist, thus add jar is failed. However this code will run completely without causing application failure. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? add new suite tests Closes #27644 from stczwd/SPARK-30868. Authored-by: lijunqing <lijunqing@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 15:42:45 +08:00
yi.wu	25f5bfaa6e	[SPARK-30903][SQL] Fail fast on duplicate columns when analyze columns <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> Add new `CommandCheck` rule and fail fast when detects duplicate columns in `AnalyzeColumnCommand`. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To avoid duplicate statistics computation for the same column in `AnalyzeColumnCommand`. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> Yes. User now get exception when input duplicate columns. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Added new test. Closes #27651 from Ngone51/fail_on_dup_cols. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-02-23 09:52:54 +09:00
LantaoJin	c0715221b2	[SPARK-30785][SQL] Create table like should keep tracksPartitionsInCatalog same with source table ### What changes were proposed in this pull request? Table generated by `CREATE TABLE LIKE` a partitioned table is a partitioned table. But when run `ALTER TABLE ADD PARTITION`, it will throw `AnalysisException: ALTER TABLE ADD PARTITION is not allowed`. That's because the default value of `tracksPartitionsInCatalog` from `CREATE TABLE LIKE` always is false. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add a unit test. Closes #27538 from LantaoJin/SPARK-30785. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-19 15:05:34 +08:00
Maxim Gekk	afaeb29599	[SPARK-30808][SQL] Enable Java 8 time API in Thrift server ### What changes were proposed in this pull request? - Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call. - Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting. ### Why are the changes needed? Because of textual representation of timestamps/dates before 1582 year is incorrect: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:07:02 ``` It must be 1001-01-01 00:00:00. ### Does this PR introduce any user-facing change? Yes. After the changes: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:00:00 ``` ### How was this patch tested? By running hive-thiftserver tests. In particular: ``` ./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite" ``` Closes #27552 from MaxGekk/hive-thriftserver-java8-time-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 02:15:44 +08:00
yi.wu	a1d536cb3e	[SPARK-15616][FOLLOW-UP][SQL] Sub Optimizer should include super.postHocOptimizationBatches ### What changes were proposed in this pull request? Let sub optimizer's `postHocOptimizationBatches` also includes super's `postHocOptimizationBatches`. ### Why are the changes needed? It's necessary according to the design of catalyst optimizer. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass jenkins. Closes #27607 from Ngone51/spark_15616_followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 01:52:23 +08:00
Yuming Wang	76ddb6d835	[SPARK-30755][SQL] Update migration guide and add actionable exception for HIVE-15167 ### What changes were proposed in this pull request? [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) removed the `SerDe` interface. This may break custom `SerDe` builds for Hive 1.2. This PR update the migration guide for this change. ### Why are the changes needed? Otherwise: ``` 2020-01-27 05:11:20.446 - stderr> 20/01/27 05:11:20 INFO DAGScheduler: ResultStage 2 (main at NativeMethodAccessorImpl.java:0) failed in 1.000 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 13, 10.110.21.210, executor 1): java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/SerDe 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.defineClass1(Native Method) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.defineClass(ClassLoader.java:756) 2020-01-27 05:11:20.446 - stderr> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader$1.run(URLClassLoader.java:363) 2020-01-27 05:11:20.446 - stderr> at java.security.AccessController.doPrivileged(Native Method) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.findClass(URLClassLoader.java:362) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:418) 2020-01-27 05:11:20.446 - stderr> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:405) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:351) 2020-01-27 05:11:20.446 - stderr> at java.lang.Class.forName0(Native Method) 2020-01-27 05:11:20.446 - stderr> at java.lang.Class.forName(Class.java:348) 2020-01-27 05:11:20.446 - stderr> at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:76) ..... ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual test Closes #27492 from wangyum/SPARK-30755. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-17 09:26:56 -08:00
Wenchen Fan	ab07c6300c	[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view ### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes #27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 12:07:46 +08:00
yi.wu	99b8136a86	[SPARK-25990][SQL] ScriptTransformation should handle different data types correctly ### What changes were proposed in this pull request? We should convert Spark InternalRows to hive data via `HiveInspectors.wrapperFor`. ### Why are the changes needed? We may hit below exception without this change: ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.1.6, executor driver): java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal [info] at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:55) [info] at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:321) [info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:292) [info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:247) [info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:231) [info] at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:55) [info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$2(ScriptTransformationExec.scala:300) [info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$2$adapted(ScriptTransformationExec.scala:281) [info] at scala.collection.Iterator.foreach(Iterator.scala:941) [info] at scala.collection.Iterator.foreach$(Iterator.scala:941) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) [info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$1(ScriptTransformationExec.scala:281) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) [info] at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new test. But please note that this test returns different result between Hive1.2 and Hive2.3 due to `HiveDecimal` or `SerDe` difference(don't know the root cause yet). Closes #27556 from Ngone51/script_transform. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-14 16:52:28 +08:00
Ali Afroozeh	e2d3983de7	[SPARK-30798][SQL] Scope Session.active in QueryExecution ### What changes were proposed in this pull request? This PR scopes `SparkSession.active` to prevent problems with processing queries with possibly different spark sessions (and different configs). A new method, `withActive` is introduced on `SparkSession` that restores the previous spark session after the block of code is executed. ### Why are the changes needed? `SparkSession.active` is a thread local variable that points to the current thread's spark session. It is important to note that the `SQLConf.get` method depends on `SparkSession.active`. In the current implementation it is possible that `SparkSession.active` points to a different session which causes various problems. Most of these problems arise because part of the query processing is done using the configurations of a different session. For example, when creating a data frame using a new session, i.e., `session.sql("...")`, part of the data frame is constructed using the currently active spark session, which can be a different session from the one used later for processing the query. ### Does this PR introduce any user-facing change? The `withActive` method is introduced on `SparkSession`. ### How was this patch tested? Unit tests (to be added) Closes #27387 from dbaliafroozeh/UseWithActiveSessionInQueryExecution. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-02-13 23:58:55 +01:00
fuwhu	f1d0dce484	[MINOR][DOC] Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions ### What changes were proposed in this pull request? Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions. ### Why are the changes needed? To describe these two classes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? no Closes #27535 from fuwhu/SPARK-15616-FOLLOW-UP. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-11 22:16:44 +08:00
Terry Kim	c27a616450	[SPARK-30612][SQL] Resolve qualified column name with v2 tables ### What changes were proposed in this pull request? This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables. This PR would allow qualified column names in query as following: ```SQL SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl ``` ### Why are the changes needed? This is a bug because you cannot qualify column names in queries. ### Does this PR introduce any user-facing change? Yes, now users can qualify column names for v2 tables. ### How was this patch tested? Added new tests. Closes #27391 from imback82/qualified_col. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 13:54:17 +08:00
fuwhu	47659a0675	[SPARK-30525][SQL] HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning ### What changes were proposed in this pull request? HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called. ### Why are the changes needed? In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Existing unit tests. Closes #27232 from fuwhu/SPARK-30525. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 01:24:53 +08:00
Liang-Chi Hsieh	8eecc20b11	[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" ## What changes were proposed in this pull request? This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table. For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL. For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration. It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported. For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table. ## How was this patch tested? Added test. Closes #24938 from viirya/SPARK-27946. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 19:55:25 -08:00
Maxim Gekk	2d4b5eaee4	[SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double ### What changes were proposed in this pull request? - Replace `new Integer(0)` by a serializable instance in RDD.scala - Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html ### Why are the changes needed? This fixes the following warnings: 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite Closes #27399 from MaxGekk/eliminate-warning-part4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 15:03:16 -06:00
Liang-Chi Hsieh	5916c7d0d0	[SPARK-30673][SQL][TESTS] Test cases in HiveShowCreateTableSuite should create Hive table ### What changes were proposed in this pull request? This patch makes the test cases in HiveShowCreateTableSuite create Hive table instead of data source table. ### Why are the changes needed? Because SparkSQL now creates data source table if no provider is specified in SQL command, some test cases in HiveShowCreateTableSuite don't create Hive table, but data source table. It is confusing and not good for the purpose of this test suite. ### Does this PR introduce any user-facing change? No, only test case. ### How was this patch tested? Unit test. Closes #27393 from viirya/SPARK-30673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2020-01-30 13:23:58 -08:00
Burak Yavuz	db528e4fe1	[SPARK-30535][SQL] Revert "[] Migrate ALTER TABLE commands to the new framework ### What changes were proposed in this pull request? This reverts commit `b5cb9abdd5`. ### Why are the changes needed? The merged commit (#27243) was too risky for several reasons: 1. It doesn't fix a bug 2. It makes the resolution of the table that's going to be altered a child. We had avoided this on purpose as having an arbitrary rule change the child of AlterTable seemed risky. This change alone is a big -1 for me for this change. 3. While the code may look cleaner, I think this approach makes certain things harder, e.g. differentiating between the Hive based Alter table CHANGE COLUMN and ALTER COLUMN syntax. Resolving and normalizing columns for ALTER COLUMN also becomes a bit harder, as we now have to check every single AlterTable command instead of just a single ALTER TABLE ALTER COLUMN statement ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests This closes #27315 Closes #27327 from brkyvz/revAlter. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-22 22:43:46 -08:00
Kent Yao	f2d71f5838	[SPARK-30591][SQL] Remove the nonstandard SET OWNER syntax for namespaces ### What changes were proposed in this pull request? This pr removes the nonstandard `SET OWNER` syntax for namespaces and changes the owner reserved properties from `ownerName` and `ownerType` to `owner`. ### Why are the changes needed? the `SET OWNER` syntax for namespaces is hive-specific and non-sql standard, we need a more future-proofing design before we implement user-facing changes for SQL security issues ### Does this PR introduce any user-facing change? no, just revert an unpublic syntax ### How was this patch tested? modified uts Closes #27300 from yaooqinn/SPARK-30591. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-22 16:00:05 +08:00
fuwhu	cfb1706eaa	[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions ### What changes were proposed in this pull request? Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns. Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy. ### Why are the changes needed? In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join. Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance. This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Added unit tests. This is based on #25919, credits should go to lianhuiwang and advancedxy. Closes #26805 from fuwhu/SPARK-15616. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 21:26:30 +08:00
Kent Yao	24efa43826	[SPARK-30019][SQL] Add the owner property to v2 table ### What changes were proposed in this pull request? Add `owner` property to v2 table, it is reversed by `TableCatalog`, indicates the table's owner. ### Why are the changes needed? enhance ownership management of catalog API ### Does this PR introduce any user-facing change? yes, add 1 reserved property - `owner` , and it is not allowed to use in OPTIONS/TBLPROPERTIES anymore, only if legacy on ### How was this patch tested? add uts Closes #27249 from yaooqinn/SPARK-30019. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 10:37:49 +08:00
Terry Kim	b5cb9abdd5	[SPARK-30535][SQL] Migrate ALTER TABLE commands to the new framework ### What changes were proposed in this pull request? Use the new framework to resolve the ALTER TABLE commands. This PR also refactors ALTER TABLE logical plans such that they extend a base class `AlterTable`. Each plan now implements `def changes: Seq[TableChange]` for any table change operations. Additionally, `UnresolvedV2Relation` and its usage is completely removed. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated existing tests Closes #27243 from imback82/v2commands_newframework. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-20 21:33:44 +08:00
Kevin Yu	96a344511e	[SPARK-25993][SQL][TESTS] Add test cases for CREATE EXTERNAL TABLE with subdirectories ### What changes were proposed in this pull request? This PR aims to add these test cases for resolution of ORC table location reported by [SPARK-25993](https://issues.apache.org/jira/browse/SPARK-25993) also add corresponding test cases for Parquet table. ### Why are the changes needed? The current behavior is complex, this test case suites are designed to prevent the accidental behavior change. This pr is rebased on master, the original pr is [23108](https://github.com/apache/spark/pull/23108) ### Does this PR introduce any user-facing change? No. This adds test cases only. ### How was this patch tested? This is a new test case. Closes #27130 from kevinyu98/spark-25993-2. Authored-by: Kevin Yu <qyu@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-17 17:17:29 -08:00
Terry Kim	64fe192fef	[SPARK-30282][SQL] Migrate SHOW TBLPROPERTIES to new framework ### What changes were proposed in this pull request? Use the new framework to resolve the SHOW TBLPROPERTIES command. This PR along with #27243 should update all the existing V2 commands with `UnresolvedV2Relation`. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-2990](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce any user-facing change? Yes `SHOW TBLPROPERTIES temp_view` now fails with `AnalysisException` will be thrown with a message `temp_view is a temp view not table`. Previously, it was returning empty row. ### How was this patch tested? Existing tests Closes #26921 from imback82/consistnet_v2command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-17 16:51:44 +08:00
Wenchen Fan	0bd7a3dfab	[SPARK-29572][SQL] add v1 read fallback API in DS v2 ### What changes were proposed in this pull request? Add a `V1Scan` interface, so that data source v1 implementations can migrate to DS v2 much easier. ### Why are the changes needed? It's a lot of work to migrate v1 sources to DS v2. The new API added here can allow v1 sources to go through v2 code paths without implementing all the Batch, Stream, PartitionReaderFactory, ... stuff. We already have a v1 write fallback API after https://github.com/apache/spark/pull/25348 ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test suite Closes #26231 from cloud-fan/v1-read-fallback. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-17 12:40:51 +08:00
Wenchen Fan	883ae331c3	[SPARK-30497][SQL] migrate DESCRIBE TABLE to the new framework ### What changes were proposed in this pull request? Use the new framework to resolve the DESCRIBE TABLE command. The v1 DESCRIBE TABLE command supports both table and view. Checked with Hive and Presto, they don't have DESCRIBE TABLE syntax but only DESCRIBE, which supports both table and view: 1. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DescribeTable/View/MaterializedView/Column 2. https://prestodb.io/docs/current/sql/describe.html We should make it clear that DESCRIBE support both table and view, by renaming the command to `DescribeRelation`. This PR also tunes the framework a little bit to support the case that a command accepts both table and view. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. Note that I make a separate PR here instead of #26921, as I need to update the framework to support a new use case: accept both table and view. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27187 from cloud-fan/describe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-15 17:38:52 -08:00
Takeshi Yamamuro	8a926e448f	[SPARK-26736][SQL] Partition pruning through nondeterministic expressions in Hive tables ### What changes were proposed in this pull request? This PR intends to improve partition pruning for nondeterministic expressions in Hive tables: Before this PR: ``` scala> sql("""create table test(id int) partitioned by (dt string)""") scala> sql("""select * from test where dt='20190101' and rand() < 0.5""").explain() == Physical Plan == (1) Filter ((isnotnull(dt#19) AND (dt#19 = 20190101)) AND (rand(6515336563966543616) < 0.5)) +- Scan hive default.test [id#18, dt#19], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#18], [dt#19], Statistics(sizeInBytes=8.0 EiB) ``` After this PR: ``` == Physical Plan == (1) Filter (rand(-9163956883277176328) < 0.5) +- Scan hive default.test [id#0, dt#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#0], [dt#1], Statistics(sizeInBytes=8.0 EiB), [isnotnull(dt#1), (dt#1 = 20190101)] ``` This PR is the rework of #24118. ### Why are the changes needed? For better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit tests added. Closes #27219 from maropu/SPARK-26736. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-16 08:36:22 +09:00
jiake	b389b8c5f0	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE ### What changes were proposed in this pull request? Fix all the failed tests when enable AQE. ### Why are the changes needed? Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests Closes #26813 from JkSelf/enableAQEDefault. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-13 22:55:19 +08:00
Maxim Gekk	f5118f81e3	[SPARK-30409][SPARK-29173][SQL][TESTS] Use `NoOp` datasource in SQL benchmarks ### What changes were proposed in this pull request? In the PR, I propose to replace `.collect()`, `.count()` and `.foreach(_ => ())` in SQL benchmarks and use the `NoOp` datasource. I added an implicit class to `SqlBasedBenchmark` with the `.noop()` method. It can be used in benchmark like: `ds.noop()`. The last one is unfolded to `ds.write.format("noop").mode(Overwrite).save()`. ### Why are the changes needed? To avoid additional overhead that `collect()` (and other actions) has. For example, `.collect()` has to convert values according to external types and pull data to the driver. This can hide actual performance regressions or improvements of benchmarked operations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Re-run all modified benchmarks using Amazon EC2. \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/10 \| - Run `TPCDSQueryBenchmark` using instructions from the PR #26049 ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` - Other benchmarks ran by the script: ``` #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'], ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ExtractBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.InExpressionBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.IntervalBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.JoinBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MakeDateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.MiscBenchmark'], ['hive/test', 'org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.OrcV2NestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.ParquetNestedSchemaPruningBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.RangeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.UDFBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.benchmark.WideTableBenchmark'], ['hive/test', 'org.apache.spark.sql.hive.orc.OrcReadBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27078 from MaxGekk/noop-in-benchmarks. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-12 13:18:19 -08:00
Erik Erlandson	1f50a5875b	[SPARK-27296][SQL] Allows Aggregator to be registered as a UDF ## What changes were proposed in this pull request? Defines a new subclass of UDF: `UserDefinedAggregator`. Also allows `Aggregator` to be registered as a udf. Under the hood, the implementation is based on the internal `TypedImperativeAggregate` class that spark's predefined aggregators make use of. The effect is that custom user defined aggregators are now serialized only on partition boundaries instead of being serialized and deserialized at each input row. The two new modes of using `Aggregator` are as follows: ```scala val agg: Aggregator[IN, BUF, OUT] = // typed aggregator val udaf1 = UserDefinedAggregator(agg) val udaf2 = spark.udf.register("agg", agg) ``` ## How was this patch tested? Unit testing has been added that corresponds to the testing suites for `UserDefinedAggregateFunction`. Additionally, unit tests explicitly count the number of aggregator ser/de cycles to ensure that it is governed only by the number of data partitions. To evaluate the performance impact, I did two comparisons. The code and REPL results are recorded on [this gist](https://gist.github.com/erikerlandson/b0e106a4dbaf7f80b4f4f3a21f05f892) To characterize its behavior I benchmarked both a relatively simple aggregator and then an aggregator with a complex structure (a t-digest). ### performance The following compares the new `Aggregator` based aggregation against UDAF. In this scenario, the new aggregation is about 100x faster. The difference in performance impact depends on the complexity of the aggregator. For very simple aggregators (e.g. implementing 'sum', etc), the performance impact is more like 25-30%. ```scala scala> import scala.util.Random._, org.apache.spark.sql.Row, org.apache.spark.tdigest._ import scala.util.Random._ import org.apache.spark.sql.Row import org.apache.spark.tdigest._ scala> val data = sc.parallelize(Vector.fill(50000){(nextInt(2), nextGaussian, nextGaussian.toFloat)}, 5).toDF("cat", "x1", "x2") data: org.apache.spark.sql.DataFrame = [cat: int, x1: double ... 1 more field] scala> val udaf = TDigestUDAF(0.5, 0) udaf: org.apache.spark.tdigest.TDigestUDAF = TDigestUDAF(0.5,0) scala> val bs = Benchmark.sample(10) { data.agg(udaf($"x1"), udaf($"x2")).first } bs: Array[(Double, org.apache.spark.sql.Row)] = Array((16.523,[TDigestSQL(TDigest(0.5,0,130,TDigestMap(-4.9171836327285225 -> (1.0, 1.0), -3.9615949140987685 -> (1.0, 2.0), -3.792874086327091 -> (0.7500781537109753, 2.7500781537109753), -3.720534874164185 -> (1.796754196108008, 4.546832349818983), -3.702105588052377 -> (0.4531676501810167, 5.0), -3.665883591332569 -> (2.3434687534153142, 7.343468753415314), -3.649982231368131 -> (0.6565312465846858, 8.0), -3.5914188829817744 -> (4.0, 12.0), -3.530472305581248 -> (4.0, 16.0), -3.4060489584449467 -> (2.9372251939818383, 18.93722519398184), -3.3000694035428486 -> (8.12412890252889, 27.061354096510726), -3.2250016655261877 -> (8.30564453211017, 35.3669986286209), -3.180537395623448 -> (6.001782561137285, 41.3687811... scala> bs.map(_._1) res0: Array[Double] = Array(16.523, 17.138, 17.863, 17.801, 17.769, 17.786, 17.744, 17.8, 17.939, 17.854) scala> val agg = TDigestAggregator(0.5, 0) agg: org.apache.spark.tdigest.TDigestAggregator = TDigestAggregator(0.5,0) scala> val udaa = spark.udf.register("tdigest", agg) udaa: org.apache.spark.sql.expressions.UserDefinedAggregator[Double,org.apache.spark.tdigest.TDigestSQL,org.apache.spark.tdigest.TDigestSQL] = UserDefinedAggregator(TDigestAggregator(0.5,0),None,true,true) scala> val bs = Benchmark.sample(10) { data.agg(udaa($"x1"), udaa($"x2")).first } bs: Array[(Double, org.apache.spark.sql.Row)] = Array((0.313,[TDigestSQL(TDigest(0.5,0,130,TDigestMap(-4.9171836327285225 -> (1.0, 1.0), -3.9615949140987685 -> (1.0, 2.0), -3.792874086327091 -> (0.7500781537109753, 2.7500781537109753), -3.720534874164185 -> (1.796754196108008, 4.546832349818983), -3.702105588052377 -> (0.4531676501810167, 5.0), -3.665883591332569 -> (2.3434687534153142, 7.343468753415314), -3.649982231368131 -> (0.6565312465846858, 8.0), -3.5914188829817744 -> (4.0, 12.0), -3.530472305581248 -> (4.0, 16.0), -3.4060489584449467 -> (2.9372251939818383, 18.93722519398184), -3.3000694035428486 -> (8.12412890252889, 27.061354096510726), -3.2250016655261877 -> (8.30564453211017, 35.3669986286209), -3.180537395623448 -> (6.001782561137285, 41.36878118... scala> bs.map(_._1) res1: Array[Double] = Array(0.313, 0.193, 0.175, 0.185, 0.174, 0.176, 0.16, 0.186, 0.171, 0.179) scala> ``` Closes #25024 from erikerlandson/spark-27296. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-12 15:18:30 +08:00
Zhenhua Wang	2bd8731813	[SPARK-30468][SQL] Use multiple lines to display data columns for show create table command ### What changes were proposed in this pull request? Currently data columns are displayed in one line for show create table command, when the table has many columns (to make things even worse, columns may have long names or comments), the displayed result is really hard to read. To improve readability, we print each column in a separate line. Note that other systems like Hive/MySQL also display in this way. Also, for data columns, table properties and options, we put the right parenthesis to the end of the last column/property/option, instead of occupying a separate line. ### Why are the changes needed? for better readability ### Does this PR introduce any user-facing change? before the change: ``` spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet OPTIONS ( `bar` '2', `foo` '1' ) TBLPROPERTIES ( 'a' = 'x', 'b' = 'y' ) ``` after the change: ``` spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet OPTIONS ( `bar` '2', `foo` '1') TBLPROPERTIES ( 'a' = 'x', 'b' = 'y') ``` ### How was this patch tested? modified existing tests Closes #27147 from wzhfy/multi_line_columns. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-10 10:55:53 -06:00
Kent Yao	bcf07cbf5f	[SPARK-30018][SQL] Support ALTER DATABASE SET OWNER syntax ### What changes were proposed in this pull request? In this pull request, we are going to support `SET OWNER` syntax for databases and namespaces, ```sql ALTER (DATABASE\|SCHEME\|NAMESPACE) database_name SET OWNER [USER\|ROLE\|GROUP] user_or_role_group; ``` Before this commit `332e252a14`, we didn't care much about ownerships for the catalog objects. In `332e252a14`, we determined to use properties to store ownership staff, and temporarily used `alter database ... set dbproperties ...` to support switch ownership of a database. This PR aims to use the formal syntax to replace it. In hive, `ownerName/Type` are fields of the database objects, also they can be normal properties. ``` create schema test1 with dbproperties('ownerName'='yaooqinn') ``` The create/alter database syntax will not change the owner to `yaooqinn` but store it in parameters. e.g. ``` +----------+----------+---------------------------------------------------------------+-------------+-------------+-----------------------+--+ \| db_name \| comment \| location \| owner_name \| owner_type \| parameters \| +----------+----------+---------------------------------------------------------------+-------------+-------------+-----------------------+--+ \| test1 \| \| hdfs://quickstart.cloudera:8020/user/hive/warehouse/test1.db \| anonymous \| USER \| {ownerName=yaooqinn} \| +----------+----------+---------------------------------------------------------------+-------------+-------------+-----------------------+--+ ``` In this pull request, because we let the `ownerName` become reversed, so it will neither change the owner nor store in dbproperties, just be omitted silently. ## Why are the changes needed? Formal syntax support for changing database ownership ### Does this PR introduce any user-facing change? yes, add a new syntax ### How was this patch tested? add unit tests Closes #26775 from yaooqinn/SPARK-30018. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-10 16:47:08 +08:00
Yuming Wang	17881a467a	[SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty ## What changes were proposed in this pull request? We invalidate table relation once table data is changed by [SPARK-21237](https://issues.apache.org/jira/browse/SPARK-21237). But there is a situation we have not invalidated(`spark.sql.statistics.size.autoUpdate.enabled=false` and `table.stats.isEmpty`): `07c4b9bd1f/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (L44-L54)` This will introduce some issues, e.g. [SPARK-19784](https://issues.apache.org/jira/browse/SPARK-19784), [SPARK-19845](https://issues.apache.org/jira/browse/SPARK-19845), [SPARK-25403](https://issues.apache.org/jira/browse/SPARK-25403), [SPARK-25332](https://issues.apache.org/jira/browse/SPARK-25332) and [SPARK-28413](https://issues.apache.org/jira/browse/SPARK-28413). This is a example to reproduce [SPARK-19784](https://issues.apache.org/jira/browse/SPARK-19784): ```scala val path = "/tmp/spark/parquet" spark.sql("CREATE TABLE t (a INT) USING parquet") spark.sql("INSERT INTO TABLE t VALUES (1)") spark.range(5).toDF("a").write.parquet(path) spark.sql(s"ALTER TABLE t SET LOCATION '${path}'") spark.table("t").count() // return 1 spark.sql("refresh table t") spark.table("t").count() // return 5 ``` This PR invalidates the table relation in this case(`spark.sql.statistics.size.autoUpdate.enabled=false` and `table.stats.isEmpty`) to fix this issue. ## How was this patch tested? unit tests Closes #22721 from wangyum/SPARK-25403. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-07 11:41:34 +08:00
Wenchen Fan	1743d5be7f	[SPARK-30284][SQL] CREATE VIEW should keep the current catalog and namespace ### What changes were proposed in this pull request? Update CREATE VIEW command to store the current catalog and namespace instead of current database in view metadata. Also update analyzer to leverage the catalog and namespace in view metastore to resolve relations inside views. Note that, this PR still keeps the way we resolve views, by recursively calling Analyzer. This is necessary because view text may contain CTE, window spec, etc. which needs rules outside of the main resolution batch (e.g. `CTESubstitution`) ### Why are the changes needed? To resolve relations inside view correctly. ### Does this PR introduce any user-facing change? Yes, fix a bug. Now tables referred by a view can be resolved correctly even if the current catalog/namespace has been updated. ### How was this patch tested? a new test Closes #26923 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 01:41:32 +08:00
Jungtaek Lim (HeartSaVioR)	5d870ef0bc	[SPARK-26560][SQL] Spark should be able to run Hive UDF using jar regardless of current thread context classloader ### What changes were proposed in this pull request? This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from #23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. ### Why are the changes needed? Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27025 from HeartSaVioR/SPARK-26560-revised. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: nivo091 <nivedeeta.singh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 15:44:45 +08:00
Zhenhua Wang	a8bf5d823b	[SPARK-30339][SQL] Avoid to fail twice in function lookup ### What changes were proposed in this pull request? Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. We should throw exceptions directly in these cases. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Covered by existing tests. Closes #26994 from wzhfy/avoid_udf_fail_twice. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:09:51 +09:00
Wing Yew Poon	c72f88b0ba	[SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table ### What changes were proposed in this pull request? When querying a partitioned table with format `org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each executor concurrently, the following exception is encountered: `java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord` The exception occurs in `HadoopTableReader.fillObject`. `org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a `cachedObjectInspector` field by calling `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not thread-safe; this `cachedObjectInspector` is returned by `JsonSerDe#getObjectInspector`. We protect against this Hive bug by synchronizing on an object when we need to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances (which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for the `Deserializer` of the partitions of the JSON table and that of the table `SerDe` are the same cached `ObjectInspector` and `HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s are different, then a bug in `HCatRecordObjectInspector` causes an `ArrayList` to be created instead of an `HCatRecord`, resulting in the `ClassCastException` that is seen.) ### Why are the changes needed? To avoid HIVE-15773 / HIVE-21752. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually on a cluster with a partitioned JSON table and running a query using more than one core per executor. Before this change, the ClassCastException happens consistently. With this change it does not happen. Closes #26895 from wypoon/SPARK-17398. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-12-20 10:39:26 -08:00
chenliang	abfc267f0c	[SPARK-30262][SQL] Avoid NumberFormatException when totalSize is empty ### What changes were proposed in this pull request? We could get the Partitions Statistics Info.But in some specail case, The Info like totalSize，rawDataSize，rowCount maybe empty. When we do some ddls like `desc formatted partition` ,the NumberFormatException is showed as below: ``` spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) ``` Although we can use 'Analyze table partition ' to update the totalSize,rawDataSize or rowCount, it's unresonable for normal SQL to throw NumberFormatException for Empty totalSize.We should fix the empty case when readHiveStats. ### Why are the changes needed? This is a related to the robustness of the code and may lead to unexpected exception in some unpredictable situation.Here is the case: <img width="981" alt="image" src="https://user-images.githubusercontent.com/20614350/70845771-7b88b400-1e8d-11ea-95b0-df5c58097d7d.png"> ### Does this PR introduce any user-facing change? No ### How was this patch tested? manual Closes #26892 from southernriver/SPARK-30262. Authored-by: chenliang <southernriver@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-18 15:12:32 -08:00
Zhenhua Wang	18431c7baa	[SPARK-30269][SQL] Should use old partition stats to decide whether to update stats when analyzing partition ### What changes were proposed in this pull request? It's an obvious bug: currently when analyzing partition stats, we use old table stats to compare with newly computed stats to decide whether it should update stats or not. ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add new tests Closes #26908 from wzhfy/failto_update_part_stats. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-17 22:21:26 +09:00
ulysses	1da7e8295c	[SPARK-30201][SQL] HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT ### What changes were proposed in this pull request? Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will convert any string to UTF-8 string. When write non UTF-8 code data, then `EFBFBD` will appear. We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes. ### Why are the changes needed? Here is the way to reproduce: 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code. 2. create table test1 (c string) location '$file_path'; 3. select hex(c) from test1; // AABBCC 4. craete table test2 (c string) as select c from test1; 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Closes #26831 from ulysses-you/SPARK-30201. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-17 12:15:53 +08:00
Yuming Wang	696288f623	[INFRA] Reverts commit `56dcd79` and `c216ef1` ### What changes were proposed in this pull request? 1. Revert "Preparing development version 3.0.1-SNAPSHOT": `56dcd79` 2. Revert "Preparing Spark release v3.0.0-preview2-rc2": `c216ef1` ### Why are the changes needed? Shouldn't change master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test: https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master Closes #26915 from wangyum/revert-master. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-12-16 19:57:44 -07:00
Yuming Wang	56dcd79992	Preparing development version 3.0.1-SNAPSHOT	2019-12-17 01:57:27 +00:00
Yuming Wang	c216ef1d03	Preparing Spark release v3.0.0-preview2-rc2	2019-12-17 01:57:21 +00:00
Wenchen Fan	982f72f4c3	[SPARK-30238][SQL] hive partition pruning can only support string and integral types ### What changes were proposed in this pull request? Check the partition column data type and only allow string and integral types in hive partition pruning. ### Why are the changes needed? Currently we only support string and integral types in hive partition pruning, but the check is done for literals. If the predicate is `InSet`, then there is no literal and we may pass an unsupported partition predicate to Hive and cause problems. ### Does this PR introduce any user-facing change? yes. fix a bug. A query fails before and can run now. ### How was this patch tested? a new test Closes #26871 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-12 13:07:20 -08:00
Pablo Langa	9cf9304e17	[SPARK-30038][SQL] DESCRIBE FUNCTION should do multi-catalog resolution ### What changes were proposed in this pull request? Add DescribeFunctionsStatement and make DESCRIBE FUNCTIONS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing DESCRIBE FUNCTIONS namespace.function ### Does this PR introduce any user-facing change? Yes. When running DESCRIBE FUNCTIONS namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26840 from planga82/feature/SPARK-30038_DescribeFunction_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 14:02:58 -08:00
Jungtaek Lim (HeartSaVioR)	538b8d101c	[SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports ### What changes were proposed in this pull request? This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate. ### Why are the changes needed? Java linter starts complaining. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? lint-java passed locally This closes #26819 Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-09 08:57:20 -08:00
Gengliang Wang	a717d219a6	[SPARK-30159][SQL][TESTS] Fix the method calls of `QueryTest.checkAnswer` ### What changes were proposed in this pull request? Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen. The actual exceptions are thrown in the trait `QueryTest`. However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers. After this PR: 1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results. 2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message. 3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`. ### Why are the changes needed? We should fix these method calls to perform actual validations in test suites. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26788 from gengliangwang/fixCheckAnswer. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-09 22:19:08 +09:00
Pablo Langa	bca9de6684	[SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands. We don’t have this methods in the catalog to implement an V2 command * catalog.listFunctions ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing `SHOW FUNCTIONS LIKE namespace.function` ### Does this PR introduce any user-facing change? Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-12-08 20:15:09 -08:00
wuyi	58be82ad4b	[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0. And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`. ### Why are the changes needed? 1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent. 2. User may gets confused in some cases. For example: ``` CREATE TABLE t1 (c1 INT) USING PARQUET; CREATE TABLE t2 (c1 INT); ``` In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case. On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`: ``` CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE; CREATE TABLE t4 AS SELECT 1 AS VALUE; ``` And these two cases together can be really confusing. 3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior. ### Does this PR introduce any user-facing change? Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider. ### How was this patch tested? Added tests in `DDLParserSuite` and `HiveDDlSuite`. Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default. Lead-authored-by: wuyi <yi.wu@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-07 02:15:25 +08:00
Sean Owen	7782b61a31	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748 I told you it'd be iterative! Closes #26765 from srowen/SPARK-29392.3. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 13:48:29 -08:00
Kent Yao	332e252a14	[SPARK-29425][SQL] The ownership of a database should be respected ### What changes were proposed in this pull request? Keep the owner of a database when executing alter database commands ### Why are the changes needed? Spark will inadvertently delete the owner of a database for executing databases ddls ### Does this PR introduce any user-facing change? NO ### How was this patch tested? add and modify uts Closes #26080 from yaooqinn/SPARK-29425. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 16:14:27 +08:00
Sean Owen	2ceed6f32c	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings ### What changes were proposed in this pull request? Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent. ### Why are the changes needed? Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13. The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings. While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`. ### Does this PR introduce any user-facing change? Should not change behavior. ### How was this patch tested? Existing tests. Closes #26748 from srowen/SPARK-29392.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-04 15:03:26 -08:00
sychen	332e593093	[SPARK-29943][SQL] Improve error messages for unsupported data type ### What changes were proposed in this pull request? Improve error messages for unsupported data type. ### Why are the changes needed? When the spark reads the hive table and encounters an unsupported field type, the exception message has only one unsupported type, and the user cannot know which field of which table. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ```create view t AS SELECT STRUCT('a' AS `$a`, 1 AS b) as q;``` current: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q ```select * from t,t_normal_1,t_normal_2``` current: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q, db: default, table: t Closes #26577 from cxzl25/unsupport_data_type_msg. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-03 10:07:09 +09:00
LantaoJin	04a5b8f5f8	[SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE ### What changes were proposed in this pull request? In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`. Hive support `STORED AS` new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`. ### Why are the changes needed? See https://github.com/apache/spark/pull/26097#issue-327424759 ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat]; ### How was this patch tested? Add UTs. Closes #26466 from LantaoJin/SPARK-29839. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 16:11:58 +08:00
Liang-Chi Hsieh	85cb388ae3	[SPARK-30050][SQL] analyze table and rename table should not erase hive table bucketing info ### What changes were proposed in this pull request? This patch adds Hive provider into table metadata in `HiveExternalCatalog.alterTableStats`. When we call `HiveClient.alterTable`, `alterTable` will erase if it can not find hive provider in given table metadata. Rename table also has this issue. ### Why are the changes needed? Because running `ANALYZE TABLE` on a Hive table, if the table has bucketing info, will erase existing bucket info. ### Does this PR introduce any user-facing change? Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase existing bucketing info. ### How was this patch tested? Unit test. Closes #26685 from viirya/fix-hive-bucket. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 13:40:11 +08:00
Yuming Wang	708ab57f37	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #26697 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-02 09:02:39 +09:00
Dongjoon Hyun	9cd174a7c9	Revert "[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column" This reverts commit `19af1fe3a2`.	2019-11-27 11:07:08 -08:00
Yuming Wang	19af1fe3a2	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #25214 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-27 18:13:33 +09:00
Dongjoon Hyun	2a28c73d81	[SPARK-30031][BUILD][SQL] Remove `hive-2.3` profile from `sql/hive` module ### What changes were proposed in this pull request? This PR aims to remove `hive-2.3` profile from `sql/hive` module. ### Why are the changes needed? Currently, we need `-Phive-1.2` or `-Phive-2.3` additionally to build `hive` or `hive-thriftserver` module. Without specifying it, the build fails like the following. This PR will recover it. ``` $ build/mvn -DskipTests compile --pl sql/hive ... [ERROR] [Error] /Users/dongjoon/APACHE/spark-merge/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:32: object serde is not a member of package org.apache.hadoop.hive ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? 1. Pass GitHub Action dependency check with no manifest change. 2. Pass GitHub Action build for all combinations. 3. Pass the Jenkins UT. Closes #26668 from dongjoon-hyun/SPARK-30031. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-25 15:17:27 -08:00
Dongjoon Hyun	c98e5eb339	[SPARK-29981][BUILD] Add hive-1.2/2.3 profiles ### What changes were proposed in this pull request? This PR aims the followings. - Add two profiles, `hive-1.2` and `hive-2.3` (default) - Validate if we keep the existing combination at least. (Hadoop-2.7 + Hive 1.2 / Hadoop-3.2 + Hive 2.3). For now, we assumes that `hive-1.2` is explicitly used with `hadoop-2.7` and `hive-2.3` with `hadoop-3.2`. The followings are beyond the scope of this PR. - SPARK-29988 Adjust Jenkins jobs for `hive-1.2/2.3` combination - SPARK-29989 Update release-script for `hive-1.2/2.3` combination - SPARK-29991 Support `hive-1.2/2.3` in PR Builder ### Why are the changes needed? This will help to switch our dependencies to update the exposed dependencies. ### Does this PR introduce any user-facing change? This is a dev-only change that the build profile combinations are changed. - `-Phadoop-2.7` => `-Phadoop-2.7 -Phive-1.2` - `-Phadoop-3.2` => `-Phadoop-3.2 -Phive-2.3` ### How was this patch tested? Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now. - [Jenkins (-Phadoop-2.7 -Phive-1.2)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull) - [Jenkins (-Phadoop-3.2 -Phive-2.3)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull) Also, from now, GitHub Action validates the following combinations. ![gha](https://user-images.githubusercontent.com/9700541/69355365-822d5e00-0c36-11ea-93f7-e00e5459e1d0.png) Closes #26619 from dongjoon-hyun/SPARK-29981. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-23 10:02:22 -08:00
Sean Owen	1febd373ea	[MINOR][TESTS] Replace JVM assert with JUnit Assert in tests ### What changes were proposed in this pull request? Use JUnit assertions in tests uniformly, not JVM assert() statements. ### Why are the changes needed? assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything. ### Does this PR introduce any user-facing change? No. The assertion logic should be identical. ### How was this patch tested? Existing tests. Closes #26581 from srowen/assertToJUnit. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-20 14:04:15 -06:00
Takeshi Yamamuro	0032d85153	[SPARK-29968][SQL] Remove the Predicate code from SparkPlan ### What changes were proposed in this pull request? This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`. Modifications are listed below; - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala` - To resolve the name conflict, rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate` - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate` This comes from the cloud-fan suggestion: https://github.com/apache/spark/pull/26420#discussion_r348005497 ### Why are the changes needed? For better code/test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26604 from maropu/RefactorPredicate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-20 21:13:51 +08:00
LantaoJin	5ac37a8265	[SPARK-29869][SQL] improve error message in HiveMetastoreCatalog#convertToLogicalRelation ### What changes were proposed in this pull request? In our production, HiveMetastoreCatalog#convertToLogicalRelation throws AssertError occasionally: ```sql scala> spark.table("hive_table").show java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:55) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:585) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:581) ... 47 elided ```` Most of cases occurred in reading a table which created by an old Spark version. After recreated the table, the issue will be gone. After deep dive, the root cause is this external table is a non-partitioned table but the `LOCATION` set to a partitioned path {{/tablename/dt=yyyymmdd}}. The partitionSpec is inferred. ### Why are the changes needed? Above error message is very confused. We need more details about assert failure information. This issue caused by `PartitioningAwareFileIndex#inferPartitioning()`. For non-HiveMetastore Spark, it's useful. But for Hive table, it shouldn't infer partition if Hive tell us it's a non partitioned table. (new added) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add UT. Closes #26499 from LantaoJin/SPARK-29869. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 15:22:08 +08:00
Dongjoon Hyun	f77c10de38	[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ ### What changes were proposed in this pull request? This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow. Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](https://github.com/apache/arrow/pull/5078)). > #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true". > This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty. ### Why are the changes needed? After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After https://github.com/apache/spark/pull/26133, JDK11 Jenkins job seem to fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/ ```scala Previous exception in task: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with JDK11. Closes #26552 from dongjoon-hyun/SPARK-ARROW-JDK11. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 23:58:15 -08:00
lajin	4de7131cff	[SPARK-29421][SQL] Supporting Create Table Like Using Provider ### What changes were proposed in this pull request? Hive support STORED AS new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` We add a similar syntax for Spark. Here we separate to two features: 1. specify a different table provider in CREATE TABLE LIKE 2. Hive compatibility In this PR, we address the first one: - [ ] Using `USING provider` to specify a different table provider in CREATE TABLE LIKE. - [ ] Using `STORED AS file_format` in CREATE TABLE LIKE to address Hive compatibility. ### Why are the changes needed? Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the definition of table tb2. The most user case is to create tb1 with the same schema of tb2. But an inconvenient case here is this command also copies the FileFormat from tb2, it cannot change the input/output format and serde. Add the ability of changing file format is useful for some scenarios like upgrading a table from a low performance file format to a high performance one (parquet, orc). ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: ```sql CREATE TABLE tbl2 LIKE tbl [USING parquet]; ``` ### How was this patch tested? Modify some exist UTs. Closes #26097 from LantaoJin/SPARK-29421. Authored-by: lajin <lajin@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-11 15:25:56 +08:00
Dongjoon Hyun	da848b1897	[SPARK-29796][SQL][TESTS] `HiveExternalCatalogVersionsSuite` should ignore preview release ### What changes were proposed in this pull request? This aims to exclude the `preview` release to recover `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) should not affect `branch-2.4`. - https://github.com/apache/spark/pull/26417 (Failed 4 times) ### Why are the changes needed? BEFORE ```scala scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/").mkString.split("\n").filter(_.contains("""<li><a href="spark-""")).map("""<a href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1)) java.util.NoSuchElementException: None.get ``` AFTER ```scala scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/").mkString.split("\n").filter(_.contains("""<li><a href="spark-""")).filterNot(_.contains("preview")).map("""<a href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1)) res5: Array[String] = Array(2.3.4, 2.4.4) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This should pass the PRBuilder. Closes #26428 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-07 10:28:32 -08:00
Xingbo Jiang	8207c835b4	Revert "Prepare Spark release v3.0.0-preview-rc2" This reverts commit `007c873ae3`.	2019-10-30 17:45:44 -07:00
Xingbo Jiang	007c873ae3	Prepare Spark release v3.0.0-preview-rc2 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the sparkR version number check logic to allow jvm version like `3.0.0-preview` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A	2019-10-30 17:42:59 -07:00
Xingbo Jiang	b33a58c0c6	Revert "Prepare Spark release v3.0.0-preview-rc1" This reverts commit `5eddbb5f1d`.	2019-10-28 22:32:34 -07:00
Xingbo Jiang	5eddbb5f1d	Prepare Spark release v3.0.0-preview-rc1 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the PySpark version from `3.0.0.dev0` to `3.0.0` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26243 from jiangxb1987/3.0.0-preview-prepare. Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-10-28 22:31:29 -07:00
Jungtaek Lim (HeartSaVioR)	fb80dfee70	[SPARK-28158][SQL][FOLLOWUP] HiveUserDefinedTypeSuite: don't use RandomDataGenerator to create row for UDT backed by ArrayType ### What changes were proposed in this pull request? There're some issues observed in `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"`: 1) Neither function (TestUDF) nor test take "nullable" point column into account. 2) ExamplePointUDT. sqlType is ArrayType which doesn't provide information how many elements are expected. RandomDataGenerator may provide less elements than needed. This patch fixes `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"` to change the type of "point" column to be non-nullable, as well as not use RandomDataGenerator to create row for UDT backed by ArrayType. ### Why are the changes needed? CI builds are failing in high occurrences. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested by running tests locally multiple times. Closes #26287 from HeartSaVioR/SPARK-28158-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-29 11:57:25 +08:00
uncleGen	0182817ea3	[SPARK-28158][SQL] Hive UDFs supports UDT type ## What changes were proposed in this pull request? After this PR, we can create and register Hive UDFs to accept UDT type, like `VectorUDT` and `MatrixUDT`. These UDTs are widely used in Spark machine learning. ## How was this patch tested? add new ut Closes #24961 from uncleGen/SPARK-28158. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-28 20:50:34 +09:00
Wenchen Fan	cdea520ff8	[SPARK-29532][SQL] Simplify interval string parsing ### What changes were proposed in this pull request? Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`. ### Why are the changes needed? Simplify the code and fix inconsistent behaviors. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the updated test cases. Closes #26190 from cloud-fan/parser. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 09:15:59 -07:00
Liang-Chi Hsieh	177bf672e4	[SPARK-29522][SQL] CACHE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog CACHE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26179 from viirya/SPARK-29522. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 15:00:21 +08:00
Yuming Wang	e99a9f78ea	[SPARK-29498][SQL] CatalogTable to HiveTable should not change the table's ownership ### What changes were proposed in this pull request? `CatalogTable` to `HiveTable` will change the table's ownership. How to reproduce: ```scala import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType} import org.apache.spark.sql.types.{LongType, StructType} val identifier = TableIdentifier("spark_29498", None) val owner = "SPARK-29498" val newTable = CatalogTable( identifier, tableType = CatalogTableType.EXTERNAL, storage = CatalogStorageFormat( locationUri = None, inputFormat = None, outputFormat = None, serde = None, compressed = false, properties = Map.empty), owner = owner, schema = new StructType().add("i", LongType, false), provider = Some("hive")) spark.sessionState.catalog.createTable(newTable, false) // The owner is not SPARK-29498 println(spark.sessionState.catalog.getTableMetadata(identifier).owner) ``` This PR makes it set the `HiveTable`'s owner to `CatalogTable`'s owner if it's owner is not empty when converting `CatalogTable` to `HiveTable`. ### Why are the changes needed? We should not change the ownership of the table when converting `CatalogTable` to `HiveTable`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? unit test Closes #26160 from wangyum/SPARK-29498. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-21 15:53:36 +08:00
angerszhu	9a3dccae72	[SPARK-29379][SQL] SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### What changes were proposed in this pull request? Current Spark SQL `SHOW FUNCTIONS` don't show `!=`, `<>`, `between`, `case` But these expressions is truly functions. We should show it in SQL `SHOW FUNCTIONS` ### Why are the changes needed? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### Does this PR introduce any user-facing change? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### How was this patch tested? UT Closes #26053 from AngersZhuuuu/SPARK-29379. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-19 00:19:56 +08:00
Liang-Chi Hsieh	5692680e37	[SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data ### What changes were proposed in this pull request? This patch proposes to delete old Hive external partition directory even the partition does not exist in Hive, when insert overwrite Hive external table partition. ### Why are the changes needed? When insert overwrite to a Hive external table partition, if the partition does not exist, Hive will not check if the external partition directory exists or not before copying files. So if users drop the partition, and then do insert overwrite to the same partition, the partition will have both old and new data. For example: ```scala withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { // test is an external Hive table. sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 2") sql("SELECT id FROM test WHERE name = 'n1' ORDER BY id") // Got both 1 and 2. } ``` ### Does this PR introduce any user-facing change? Yes. This fix a correctness issue when users drop partition on a Hive external table partition and then insert overwrite it. ### How was this patch tested? Added test. Closes #25979 from viirya/SPARK-29295. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:35:44 +08:00
Kent Yao	ef4c298cc9	[SPARK-29405][SQL] Alter table / Insert statements should not change a table's ownership ### What changes were proposed in this pull request? In this change, we give preference to the original table's owner if it is not empty. ### Why are the changes needed? When executing 'insert into/overwrite ...' DML, or 'alter table set tblproperties ...' DDL, spark would change the ownership of the table the one who runs the spark application. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Compare with the behavior of Apache Hive Closes #26068 from yaooqinn/SPARK-29405. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:21:31 +08:00
Jiajia Li	dc0bc7a6eb	[MINOR][DOCS] Fix some typos ### What changes were proposed in this pull request? This PR proposes a few typos: 1. Sparks => Spark's 2. parallize => parallelize 3. doesnt => doesn't Closes #26140 from plusplusjiajia/fix-typos. Authored-by: Jiajia Li <jiajia.li@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-17 07:22:01 -07:00
Jose Torres	5a482e7209	[SPARK-29468][SQL] Change Literal.sql to be correct for floats ### What changes were proposed in this pull request? Change Literal.sql to output CAST('fpValue' AS FLOAT) instead of CAST(fpValue AS FLOAT) as the SQL for a floating point literal. ### Why are the changes needed? The old version doesn't work for very small floating point numbers; the value will fail to parse if it doesn't fit in a DECIMAL(38). This doesn't apply to doubles because they have special literal syntax. ### Does this PR introduce any user-facing change? Not really. ### How was this patch tested? New unit tests. Closes #26114 from jose-torres/fpliteral. Authored-by: Jose Torres <joseph.torres@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-16 21:06:13 +08:00
Gengliang Wang	322ec0ba9b	[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default ### What changes were proposed in this pull request? When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy": 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow). 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1. 3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default. Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0. ### Why are the changes needed? Following the ANSI SQL standard is most reasonable among the 3 policies. ### Does this PR introduce any user-facing change? Yes. The default store assignment policy is ANSI for both V1 and V2 data sources. ### How was this patch tested? Unit test Closes #26107 from gengliangwang/ansiPolicyAsDefault. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-15 10:41:37 -07:00
Wenchen Fan	8915966bf4	[SPARK-29473][SQL] move statement logical plans to a new file ### What changes were proposed in this pull request? move the statement logical plans that were created for v2 commands to a new file `statements.scala`, under the same package of `v2Commands.scala`. This PR also includes some minor cleanups: 1. remove `private[sql]` from `ParsedStatement` as it's in the private package. 2. remove unnecessary override of `output` and `children`. 3. add missing classdoc. ### Why are the changes needed? Similar to https://github.com/apache/spark/pull/26111 , this is to better organize the logical plans of data source v2. It's a bit weird to put the statements in the package `org.apache.spark.sql.catalyst.plans.logical.sql` as `sql` is not a good sub-package name in Spark SQL. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26125 from cloud-fan/statement. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-15 15:05:49 +02:00
Dongjoon Hyun	abba53e78b	[SPARK-27831][FOLLOWUP][SQL][TEST] ADDITIONAL_REMOTE_REPOSITORIES is a comma-delimited string ### What changes were proposed in this pull request? This PR is a very minor follow-up to become robust because `spark.sql.additionalRemoteRepositories` is a configuration which has a comma-separated value. ### Why are the changes needed? This makes sure that `getHiveContribJar` will not fail on the configuration changes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. Change the default value with multiple repositories and run the following. ``` build/sbt -Phive "project hive" "test-only org.apache.spark.sql.hive.HiveSparkSubmitSuite" ``` Closes #26096 from dongjoon-hyun/SPARK-27831. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-12 18:47:28 -07:00
Sean Owen	2d871ad0e7	[SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo") ### What changes were proposed in this pull request? Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")` ### Why are the changes needed? Avoids ~50 deprecation warnings when attempting to build with 2.13. ### Does this PR introduce any user-facing change? None, should be no functional change at all. ### How was this patch tested? Existing tests. Closes #26061 from srowen/SPARK-29392. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:15:37 -07:00
Wenchen Fan	275e044ba8	[SPARK-29039][SQL] centralize the catalog and table lookup logic ### What changes were proposed in this pull request? Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is 1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`. 2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`. However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places. This PR centralizes the catalog/table lookup logic by 3 rules. 1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command. 2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`. 3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore. ### Why are the changes needed? Reduce duplicated code and make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25747 from cloud-fan/lookup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 16:21:13 +08:00

... 2 3 4 5 6 ...

2537 commits