History

sandeep.katta b0322bf05a [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: ============================= "worker3": waiting to lock monitor 0x00007faf0be602b8 (object 0x00000007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x00007faf0be5fc88 (object 0x0000000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: =================================================== "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x00000007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x0000000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x0000000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x00000007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2086/1088974677.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.Dataset$$$Lambda$1959/1977822284.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606) at org.apache.spark.sql.SparkSession$$Lambda$1899/424830920.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1.run(<console>:45) at java.lang.Thread.run(Thread.java:748) "worker0": at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - waiting to lock <0x0000000785c15c80 > (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:851) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:432) - locked <0x00000007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:509) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x00000007b529af58> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2086/1088974677.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.Dataset$$$Lambda$1959/1977822284.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606) at org.apache.spark.sql.SparkSession$$Lambda$1899/424830920.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1.run(<console>:45) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested using below script by executing in spark-shell and I found no dead lock launch spark-shell using ./bin/spark-shell --conf "spark.sql.hive.metastore.jars=maven" --conf spark.sql.hive.metastore.version=3.1 --conf spark.hadoop.datanucleus.schema.autoCreateAll=true code ``` def testHiveDeadLock = { import scala.collection.mutable.ArrayBuffer import scala.util.Random println("test hive DeadLock") spark.sql("drop database if exists testDeadLock cascade") spark.sql("create database testDeadLock") spark.sql("use testDeadLock") val tableCount = 100 val tableNamePrefix = "testdeadlock" for (i <- 0 until tableCount) { val tableName = s"$tableNamePrefix${i + 1}" spark.sql(s"drop table if exists $tableName") spark.sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored as orc") } val threads = new ArrayBuffer[Thread] for (i <- 0 until tableCount) { threads.append(new Thread( new Runnable { override def run: Unit = { val tableName = s"$tableNamePrefix${i + 1}" val rand = Random val df = spark.range(0, 20000).toDF("a") val location = s"/tmp/${rand.nextLong.abs}" df.write.mode("overwrite").orc(location) spark.sql( s""" LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition (b=$i)""") } }, s"worker$i")) threads(i).start() } for (i <- 0 until tableCount) { println(s"Joining with thread $i") threads(i).join() } for (i <- 0 until tableCount) { val tableName = s"$tableNamePrefix${i + 1}" spark.sql(s"select count(*) from $tableName").show(false) } println("All done") } for(i <- 0 until 100) { testHiveDeadLock println(s"completed {$i}th iteration") } } ``` Closes #29649 from sandeep-katta/metastore3.1DeadLock. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-09-07 15:10:33 +09:00
..
catalyst	[SPARK-32677][SQL] Load function resource before create	2020-09-07 06:00:23 +00:00
core	[SPARK-32677][SQL] Load function resource before create	2020-09-07 06:00:23 +00:00
hive	[SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock	2020-09-07 15:10:33 +09:00
hive-thriftserver	[SPARK-32772][SQL][FOLLOWUP] Remove legacy silent support mode for spark-sql CLI	2020-09-04 08:38:35 +08:00
create-docs.sh	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc	2020-04-27 17:08:52 +09:00
gen-sql-api-docs.py	[SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info	2020-04-23 13:33:04 +09:00
gen-sql-config-docs.py	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc	2020-04-27 17:08:52 +09:00
gen-sql-functions-docs.py	[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp	2020-04-26 11:46:52 -07:00
mkdocs.yml	[SPARK-30731] Update deprecated Mkdocs option	2020-02-19 17:28:58 +09:00
README.md	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options	2020-02-09 19:20:47 +09:00

README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.