ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	75afd88904	Revert "[SPARK-31926][SQL][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber" This reverts commit `a0187cd6b5`.	2020-06-15 19:04:23 -07:00
Kent Yao	a0187cd6b5	[SPARK-31926][SQL][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber ### What changes were proposed in this pull request? This PR brings `02f32cfae4` back which reverted by `4a25200cd7` because of maven test failure diffs newly made: 1. add a missing log4j file to test resources 2. Call `SessionState.detachSession()` to clean the thread local one in `afterAll`. 3. Not use dedicated JVMs for sbt test runner too ### Why are the changes needed? fix the maven test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new tests Closes #28797 from yaooqinn/SPARK-31926-NEW. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-15 06:10:24 +00:00
Dongjoon Hyun	4a25200cd7	Revert "[SPARK-31926][SQL][TEST-HIVE1.2] Fix concurrency issue for ThriftCLIService to getPortNumber" This reverts commit `02f32cfae4`.	2020-06-10 17:21:03 -07:00
Kent Yao	02f32cfae4	[SPARK-31926][SQL][TEST-HIVE1.2] Fix concurrency issue for ThriftCLIService to getPortNumber ### What changes were proposed in this pull request? When` org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext` called, it starts `ThriftCLIService` in the background with a new Thread, at the same time we call `ThriftCLIService.getPortNumber,` we might not get the bound port if it's configured with 0. This PR moves the TServer/HttpServer initialization code out of that new Thread. ### Why are the changes needed? Fix concurrency issue, improve test robustness. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? add new tests Closes #28751 from yaooqinn/SPARK-31926. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-09 16:49:40 +00:00
Juliusz Sompolski	af35691de4	[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues ### What changes were proposed in this pull request? Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone. If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is: * The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC. * When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet. * Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation. * The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone) The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861. Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859. Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863. I used the opportunity to move some repetitive code from the operations to the mixin helper trait. Closes #28671 from juliuszsompolski/SPARK-31861. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-30 06:14:32 +00:00
Kent Yao	fe1da296da	[SPARK-31833][SQL][TEST-HIVE1.2] Set HiveThriftServer2 with actual port while configured 0 ### What changes were proposed in this pull request? When I was developing some stuff based on the `DeveloperAPI ` `org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext`, I need to use thrift port randomly to avoid race on ports. But the `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` do not respond to me with the actual bound port but always 0. And the server log is not right too, after starting the server, it's hard to form to the right JDBC connection string. ``` INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 0 with 5...500 worker threads ``` Indeed, the `53742` is the right port ```shell lsof -nP -p `cat ./pid/spark-kentyao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid` \| grep LISTEN java 18990 kentyao 288u IPv6 0x22858e3e60d6a0a7 0t0 TCP 10.242.189.214:53723 (LISTEN) java 18990 kentyao 290u IPv6 0x22858e3e60d68827 0t0 TCP :4040 (LISTEN) java 18990 kentyao 366u IPv6 0x22858e3e60d66987 0t0 TCP 10.242.189.214:53724 (LISTEN) java 18990 kentyao 438u IPv6 0x22858e3e60d65d47 0t0 TCP :53742 (LISTEN) ``` In the PR, when the port is configured 0, the `portNum` will be set to the real used port during the start process. Also use 0 in thrift related tests to avoid potential flakiness. ### Why are the changes needed? 1 fix API bug 2 reduce test flakiness ### Does this PR introduce _any_ user-facing change? yes, `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` will always give you the actual port when it is configured 0. ### How was this patch tested? modified unit tests Closes #28651 from yaooqinn/SPARK-31833. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-29 04:13:23 +00:00
Ali Smesseim	d40ecfa3f7	[SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? This is a recreation of #28155, which was reverted due to causing test failures. The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. To improve robustness, we also make the following changes in HiveSessionImpl.close(): - Catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. - Handle not being able to access the scratch directory. When closing, all `.pipeout` files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28544 from alismess-db/hive-thriftserver-listener-update-safer-2. Authored-by: Ali Smesseim <ali.smesseim@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-20 10:30:17 -07:00
Dongjoon Hyun	bbb62c5405	Revert "[SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener" This reverts commit `6994c64efd`.	2020-05-14 12:01:03 -07:00
Ali Smesseim	6994c64efd	[SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer. Authored-by: Ali Smesseim <ali.smesseim@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-05-12 09:14:34 -07:00
Christian Stuart	bcce1b1040	[SPARK-30904][SQL] Thrift RowBasedSet serialization throws NullPointerException on NULL BigDecimal ### What changes were proposed in this pull request? This PR fixes SPARK-30904 by adding a null check. ### Why are the changes needed? For HIVE_CLI_SERVICE_PROTOCOL_V5 and below, serialization fails on NULL-containing decimal columns, caused by a call to `value.toPlainString()`, where `value` might be null. This null check fixes it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? A test was added for serialization of NULL decimals for all HIVE_CLI_SERVICE_PROTOCOL versions. Closes #27654 from CJStuart/SPARK-30904. Authored-by: Christian Stuart <christian.stuart@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2020-02-21 21:39:35 -07:00
Sean Owen	789a4abfa9	[MINOR][HIVE] Pick up HIVE-22708 HTTP transport fix ### What changes were proposed in this pull request? Pick up the HTTP fix from https://issues.apache.org/jira/browse/HIVE-22708 ### Why are the changes needed? This is a small but important fix to digest handling we should pick up from Hive. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #27273 from srowen/Hive22708. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-18 11:50:59 -08:00
Ajith	2be5286828	[SPARK-30382][SQL] Remove Hive LogUtils usage to prevent ClassNotFoundException Avoid hive log initialisation as https://github.com/apache/hive/blob/rel/release-2.3.5/common/src/java/org/apache/hadoop/hive/common/LogUtils.java introduces dependency over `org.apache.logging.log4j.core.impl.Log4jContextFactory` which is missing in our spark installer classpath directly. I believe the `LogUtils.initHiveLog4j()` code is here as the HiveServer2 class is copied from Hive. To make `start-thriftserver.sh --help` command success. Currently, start-thriftserver.sh --help throws ``` ... Thrift server options: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/logging/log4j/spi/LoggerContextFactory at org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:167) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:82) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.spi.LoggerContextFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 3 more ``` No Checked Manually Closes #27042 from ajithme/thrifthelp. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-07 14:26:04 -08:00
Dongjoon Hyun	6625b69027	[SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short ### What changes were proposed in this pull request? This is a follow-up according to liancheng 's advice. - https://github.com/apache/spark/pull/26619#discussion_r349326090 ### Why are the changes needed? Previously, we chose the full version to be carefully. As of today, it seems that `Apache Hive 2.3` branch seems to become stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the compile combination on GitHub Action. 1. hadoop-2.7/hive-1.2/JDK8 2. hadoop-2.7/hive-2.3/JDK8 3. hadoop-3.2/hive-2.3/JDK8 4. hadoop-3.2/hive-2.3/JDK11 Also, pass the Jenkins with `hadoop-2.7` and `hadoop-3.2` for (1) and (4). (2) and (3) is not ready in Jenkins. Closes #26645 from dongjoon-hyun/SPARK-RENAME-HIVE-DIRECTORY. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-23 12:50:50 -08:00

13 commits