Commit graph

479 commits

Author SHA1 Message Date
HyukjinKwon 4a096131ee Revert "[SPARK-32772][SQL][FOLLOWUP] Remove legacy silent support mode for spark-sql CLI"
This reverts commit f1f7ae420e.
2020-09-10 14:23:10 +09:00
Kent Yao 9ab8a2c36d [SPARK-32826][SQL] Set the right column size for the null type in SparkGetColumnsOperation
### What changes were proposed in this pull request?

In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns but now we can because the side effect of https://issues.apache.org/jira/browse/SPARK-32696 / f14f3742e0, but the test coverage for this change was not added.

In Spark, the column size for null fields should be 1, in this PR, we set the right column size for the null type.

### Why are the changes needed?

test coverage and fix the client-side information about the null type through jdbc

### Does this PR introduce _any_ user-facing change?

NO
### How was this patch tested?

added ut both for this pr and SPARK-32696

Closes #29687 from yaooqinn/SPARK-32826.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 04:53:38 +00:00
Yuming Wang f1f7ae420e [SPARK-32772][SQL][FOLLOWUP] Remove legacy silent support mode for spark-sql CLI
### What changes were proposed in this pull request?

Remove legacy silent support mode for spark-sql CLI.

### Why are the changes needed?

https://github.com/apache/spark/pull/29619 add new silent mode. We can remove legacy silent support mode.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test:
```
spark-sql> LM-SHC-16508156:spark yumwang$ bin/spark-sql -S
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
20/09/03 09:06:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/09/03 09:06:16 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/09/03 09:06:16 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/09/03 09:06:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
20/09/03 09:06:19 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore yumwang10.226.196.190
spark-sql> select * from test1;
1
spark-sql> select * from test1;
1

```

Closes #29631 from wangyum/SPARK-32772.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2020-09-04 08:38:35 +08:00
Kousuke Saruta ad6b887541 [SPARK-32772][SQL] Reduce log messages for spark-sql CLI
### What changes were proposed in this pull request?

This PR reduces log messages for spark-sql CLI like spark-shell and pyspark CLI.

### Why are the changes needed?

When we launch spark-sql CLI, too many log messages are shown and it's sometimes difficult to find the result of query.
```
spark-sql> SELECT now();
20/09/02 00:11:45 INFO CodeGenerator: Code generated in 10.121625 ms
20/09/02 00:11:45 INFO SparkContext: Starting job: main at NativeMethodAccessorImpl.java:0
20/09/02 00:11:45 INFO DAGScheduler: Got job 0 (main at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/09/02 00:11:45 INFO DAGScheduler: Final stage: ResultStage 0 (main at NativeMethodAccessorImpl.java:0)
20/09/02 00:11:45 INFO DAGScheduler: Parents of final stage: List()
20/09/02 00:11:45 INFO DAGScheduler: Missing parents: List()
20/09/02 00:11:45 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at main at NativeMethodAccessorImpl.java:0), which has no missing parents
20/09/02 00:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.3 KiB, free 366.3 MiB)
20/09/02 00:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.2 KiB, free 366.3 MiB)
20/09/02 00:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.204:42615 (size: 3.2 KiB, free: 366.3 MiB)
20/09/02 00:11:45 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1348
20/09/02 00:11:45 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at main at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/09/02 00:11:45 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
20/09/02 00:11:45 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.1.204, executor driver, partition 0, PROCESS_LOCAL, 7561 bytes) taskResourceAssignments Map()
20/09/02 00:11:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/09/02 00:11:45 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1446 bytes result sent to driver
20/09/02 00:11:45 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 238 ms on 192.168.1.204 (executor driver) (1/1)
20/09/02 00:11:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/09/02 00:11:45 INFO DAGScheduler: ResultStage 0 (main at NativeMethodAccessorImpl.java:0) finished in 0.343 s
20/09/02 00:11:45 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
20/09/02 00:11:45 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
20/09/02 00:11:45 INFO DAGScheduler: Job 0 finished: main at NativeMethodAccessorImpl.java:0, took 0.377489 s
2020-09-02 00:11:45.07
Time taken: 0.704 seconds, Fetched 1 row(s)
20/09/02 00:11:45 INFO SparkSQLCLIDriver: Time taken: 0.704 seconds, Fetched 1 row(s)
```

### Does this PR introduce _any_ user-facing change?

Yes. Log messages are reduced for spark-sql CLI like as follows.
```
20/09/02 00:34:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/09/02 00:34:53 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/09/02 00:34:53 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/09/02 00:34:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
20/09/02 00:34:55 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore kou192.168.1.204
Spark master: local[*], Application Id: local-1598974492822
spark-sql> SELECT now();
2020-09-02 00:35:05.258
Time taken: 2.299 seconds, Fetched 1 row(s)
```

### How was this patch tested?

Launched spark-sql CLI and confirmed that log messages are reduced as I paste above.

Closes #29619 from sarutak/suppress-log-for-spark-sql.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-02 13:31:06 -07:00
Ali Smesseim 3cde392b69 [SPARK-31831][SQL][FOLLOWUP] Make the GetCatalogsOperationMock for HiveSessionImplSuite compile with the proper Hive version
### What changes were proposed in this pull request?
#29129 duplicated GetCatalogsOperationMock in the hive-version-specific subdirectories, otherwise profile hive-1.2 would not compile. We can prevent duplication of this class by shimming the required hive-version-specific types.

### Why are the changes needed?
This is a cleanup to avoid duplication of a mock class.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This patch only changes tests.

Closes #29549 from alismess-db/get-catalogs-operation-mock-use-shim.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2020-09-02 20:23:57 +08:00
angerszhu 55ce49ed28 [SPARK-32400][SQL][TEST][FOLLOWUP][TEST-MAVEN] Fix resource loading error in HiveScripTransformationSuite
### What changes were proposed in this pull request?
#29401 move `test_script.py` from sql/hive module to sql/core module, cause HiveScripTransformationSuite load resource issue.

### Why are the changes needed?
This issue cause jenkins test failed in mvn

spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/
spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/
spark-master-test-maven-hadoop-3.2-hive-2.3:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3/
![image](https://user-images.githubusercontent.com/46485123/91681585-71285a80-eb81-11ea-8519-99fc9783d6b9.png)

![image](https://user-images.githubusercontent.com/46485123/91681010-aaf86180-eb7f-11ea-8dbb-61365a3b0ab4.png)

Error as below:
```
 Exception thrown while executing Spark plan:
 HiveScriptTransformation [a#349299, b#349300, c#349301, d#349302, e#349303], python /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py, [a#349309, b#349310, c#349311, d#349312, e#349313], ScriptTransformationIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
+- Project [_1#349288 AS a#349299, _2#349289 AS b#349300, _3#349290 AS c#349301, _4#349291 AS d#349302, _5#349292 AS e#349303]
   +- LocalTableScan [_1#349288, _2#349289, _3#349290, _4#349291, _5#349292]

 == Exception ==
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18021.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18021.0 (TID 37324) (192.168.10.31 executor driver): org.apache.spark.SparkException: Subprocess exited with status 2. Error: python: can't open file '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py': [Errno 2] No such file or directory

 at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate(BaseScriptTransformationExec.scala:180)
 at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate$(BaseScriptTransformationExec.scala:157)
 at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec.checkFailureAndPropagate(HiveScriptTransformationExec.scala:49)
 at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec$$anon$1.hasNext(HiveScriptTransformationExec.scala:110)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
 at o
```
### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Existed UT

Closes #29588 from AngersZhuuuu/SPARK-32400-FOLLOWUP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-02 18:27:29 +09:00
Kent Yao 6dacba7fa0 [SPARK-32733][SQL] Add extended information - arguments/examples/since/notes of expressions to the remarks field of GetFunctionsOperation
### What changes were proposed in this pull request?

This PR adds extended information of a function including arguments, examples, notes and the since field to the SparkGetFunctionOperation

### Why are the changes needed?

better user experience, it will help JDBC users to have a better understanding of our builtin functions

### Does this PR introduce _any_ user-facing change?

Yes, BI tools and JDBC users will get full information on a spark function instead of only fragmentary usage info.

e.g. date_part

#### before

```
date_part(field, source) - Extracts a part of the date/timestamp or interval source.
```
#### after

```
    Usage:
      date_part(field, source) - Extracts a part of the date/timestamp or interval source.

    Arguments:
      * field - selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function `EXTRACT`.
      * source - a date/timestamp or interval column from where `field` should be extracted

    Examples:
      > SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
       2019
      > SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456');
       33
      > SELECT date_part('doy', DATE'2019-08-12');
       224
      > SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
       1.000001
      > SELECT date_part('days', interval 1 year 10 months 5 days);
       5
      > SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
       30.001001

    Note:
      The date_part function is equivalent to the SQL-standard function `EXTRACT(field FROM source)`

    Since: 3.0.0

```

### How was this patch tested?

New tests

Closes #29577 from yaooqinn/SPARK-32733.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:03:01 +09:00
Kent Yao f14f3742e0 [SPARK-32696][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] Get columns operation should handle interval column properly
### What changes were proposed in this pull request?

This PR let JDBC clients identify spark interval columns properly.

### Why are the changes needed?

JDBC users can query interval values through thrift server, create views with interval columns, e.g.
```sql
CREATE global temp view view1 as select interval 1 day as i;
```
but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: INTERVAL`

```
Caused by: java.lang.IllegalArgumentException: Unrecognized type name: INTERVAL
	at org.apache.hadoop.hive.serde2.thrift.Type.getType(Type.java:170)
	at org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils$.toJavaSQLType(ThriftserverShimUtils.scala:53)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:157)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:149)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6(SparkGetColumnsOperation.scala:113)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6$adapted(SparkGetColumnsOperation.scala:112)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:112)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:111)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:111)
	... 34 more
```
### Does this PR introduce _any_ user-facing change?

YES,
#### before
![image](https://user-images.githubusercontent.com/8326978/91162239-6cd1ec80-e6fe-11ea-8c2c-914ddb325c4e.png)

#### after
![image](https://user-images.githubusercontent.com/8326978/91162025-1a90cb80-e6fe-11ea-94c4-03a6f2ec296b.png)

### How was this patch tested?

new tests

Closes #29539 from yaooqinn/SPARK-32696.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-27 06:52:34 +00:00
Kent Yao c26a97637f Revert "[SPARK-32412][SQL] Unify error handling for spark thrift serv…
…er operations"

### What changes were proposed in this pull request?

This reverts commit 510a1656e6.

### Why are the changes needed?

see https://github.com/apache/spark/pull/29204#discussion_r475716547

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

pass ci tools

Closes #29531 from yaooqinn/revert.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-25 05:57:14 +00:00
Kent Yao 1b6f482adb [SPARK-32492][SQL][FOLLOWUP][TEST-MAVEN] Fix jenkins maven jobs
### What changes were proposed in this pull request?

The newly added test fails Jenkins maven jobs, see https://github.com/apache/spark/pull/29303#discussion_r464729021

We move the test from `ThriftServerWithSparkContextSuite` to `SparkMetadataOperationSuite`, the former uses an embedded thrift server where the server and the client are in the same JVM process and the latter forks a new process to start the server where the server and client are isolated.
The sbt runner seems to be fine with the test in the `ThriftServerWithSparkContextSuite`, but the maven runner with `scalates`t plugin will face the classloader issue as we will switch classloader to the one in the `sharedState` which is not the one that hive uses to load some classes. This is more like an issue that belongs to the maven runner or the `scalatest`.
So in this PR, we simply move it to bypass the issue.

BTW, we should test against the way of using embedded thrift server to verify whether it is just a maven issue or not, there could be some use cases with this API.

### Why are the changes needed?

Jenkins recovery

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

modified uts

Closes #29347 from yaooqinn/SPARK-32492-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-05 13:56:11 +00:00
Kent Yao 7f5326c082 [SPARK-32492][SQL] Fulfill missing column meta information COLUMN_SIZE /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION for thriftserver client tools
### What changes were proposed in this pull request?

This PR fulfills some missing fields for SparkGetColumnsOperation including COLUMN_SIZE /DECIMAL_DIGITS/NUM_PREC_RADIX/ORDINAL_POSITION

and improve the test coverage.

### Why are the changes needed?

make jdbc tools happier

### Does this PR introduce _any_ user-facing change?

yes,

#### before
![image](https://user-images.githubusercontent.com/8326978/88911764-e78b2180-d290-11ea-8abb-96f137f9c3c4.png)

#### after

![image](https://user-images.githubusercontent.com/8326978/88911709-d04c3400-d290-11ea-90ab-02bda3e628e9.png)

![image](https://user-images.githubusercontent.com/8326978/88912007-39cc4280-d291-11ea-96d6-1ef3abbbddec.png)

### How was this patch tested?

add unit tests

Closes #29303 from yaooqinn/SPARK-32492.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-03 12:49:34 +00:00
Kent Yao 510a1656e6 [SPARK-32412][SQL] Unify error handling for spark thrift server operations
### What changes were proposed in this pull request?

Log error/warn message only once at the server-side for both sync and async modes

### Why are the changes needed?

In b151194299 we make the error logging for  SparkExecuteStatementOperation with `runInBackground=true` not duplicated, but the operations with runInBackground=false and other metadata operation still will be log twice which happened in the operation's `runInternal` method and ThriftCLIService.

In this PR, I propose to reflect the logic to get a unified error handling approach.

### Does this PR introduce _any_ user-facing change?

Yes, when spark.sql.hive.thriftServer.async=false and people call sync APIs the error message will be logged only once at server-side.
### How was this patch tested?

locally verified the result in target/unit-test.log

add unit tests.

Closes #29204 from yaooqinn/SPARK-32412.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-30 06:44:13 +00:00
sychen be9f03dc71 [SPARK-32426][SQL] ui shows sql after variable substitution
### What changes were proposed in this pull request?
When submitting sql with variables, the sql displayed by ui is not replaced by variables.

### Why are the changes needed?
See the final executed sql in ui

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test

Closes #29221 from cxzl25/SPARK-32426.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-25 03:30:01 -07:00
Sean Owen be2eca22e9 [SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+
### What changes were proposed in this pull request?

Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes.

### Why are the changes needed?

3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility.

### Does this PR introduce _any_ user-facing change?

No, only affects tests.

### How was this patch tested?

Existing tests.

Closes #29196 from srowen/SPARK-32398.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-23 16:20:17 -07:00
Kent Yao b151194299 [SPARK-32392][SQL] Reduce duplicate error log for executing sql statement operation in thrift server
### What changes were proposed in this pull request?

This PR removes the duplicated error log which has been logged in `org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation#execute` but logged again in `runInternal`.

Besides, the log4j configuration for SparkExecuteStatementOperation is turned off because it's not very friendly for Jenkins

### Why are the changes needed?

remove the duplicated error log for better  user experience

### Does this PR introduce _any_ user-facing change?

Yes, less log in thrift server's driver log

### How was this patch tested?

locally verified the result in target/unit-test.log

Closes #29189 from yaooqinn/SPARK-32392.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-07-23 10:12:13 +09:00
Kent Yao 29b7eaa438 [MINOR][SQL] Fix warning message for ThriftCLIService.GetCrossReference and GetPrimaryKeys
### What changes were proposed in this pull request?

This PR fixes the warning message for ThriftCLIService.GetCrossReference and GetPrimaryKeys

### Why are the changes needed?

Although we haven't had our own implementation for these thrift APIs, but it still worth logging the right message when people call them wrongly.

### Does this PR introduce _any_ user-facing change?

yes, the driver log for the thrift server will log the right message for the ThriftCLIService.GetCrossReference and GetPrimaryKeys APIs

### How was this patch tested?

passing Jenkins.

Closes #29184 from yaooqinn/minor.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-22 08:15:04 +00:00
Sean Owen ee624821a9 [SPARK-29292][YARN][K8S][MESOS] Fix Scala 2.13 compilation for remaining modules
### What changes were proposed in this pull request?

See again the related PRs like https://github.com/apache/spark/pull/28971
This completes fixing compilation for 2.13 for all but `repl`, which is a separate task.

### Why are the changes needed?

Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12)

Closes #29147 from srowen/SPARK-29292.4.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-18 15:08:00 -07:00
Kent Yao efa70b8755 [SPARK-32145][SQL][FOLLOWUP] Fix type in the error log of SparkOperation
### What changes were proposed in this pull request?

Fix typo error in the error log of SparkOperation trait, reported by https://github.com/apache/spark/pull/28963#discussion_r454954542

### Why are the changes needed?

fix error in thrift server driver log

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Passing GitHub actions

Closes #29140 from yaooqinn/SPARK-32145-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-17 04:50:26 +00:00
Frank Yin 9747e8fc9d [SPARK-31831][SQL][TESTS][FOLLOWUP] Put mocks for HiveSessionImplSuite in hive version related subdirectories
### What changes were proposed in this pull request?

This patch fixes the build issue on Hive 1.2 profile brought by #29069, via putting mocks for HiveSessionImplSuite in hive version related subdirectories, so that maven build will pick up the proper source code according to the profile.

### Why are the changes needed?

#29069 fixed the flakiness of HiveSessionImplSuite, but given the patch relied on the default profile (Hive 2.3) it broke the build with Hive 1.2 profile. This patch addresses both Hive versions.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually confirmed the test suite via below command:

> Hive 1.2
```
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite test -Phive-1.2 -Phadoop-2.7 -Phive-thriftserver
```

> Hive 2.3

```
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite test -Phive-2.3 -Phadoop-3.2 -Phive-thriftserver
```

Closes #29129 from frankyin-factual/hive-tests.

Authored-by: Frank Yin <frank@factual.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-07-17 11:14:25 +09:00
Erik Krogen cf22d947fb [SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?

This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.

This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.

### Why are the changes needed?

As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.

### Does this PR introduce _any_ user-facing change?

In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.

### How was this patch tested?

Existing tests should be suitable since no behavior changes are expected as a result of this PR.

Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.

Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-07-15 11:40:55 -05:00
Holden Karau 90ac9f975b [SPARK-32004][ALL] Drop references to slave
### What changes were proposed in this pull request?

This change replaces the world slave with alternatives matching the context.

### Why are the changes needed?

There is no need to call things slave, we might as well use better clearer names.

### Does this PR introduce _any_ user-facing change?

Yes, the ouput JSON does change. To allow backwards compatibility this is an additive change.
The shell scripts for starting & stopping workers are renamed, and for backwards compatibility old scripts are added to call through to the new ones while printing a deprecation message to stderr.

### How was this patch tested?

Existing tests.

Closes #28864 from holdenk/SPARK-32004-drop-references-to-slave.

Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-07-13 14:05:33 -07:00
Frank Yin ad90cbff42 [SPARK-31831][SQL][TESTS] Use subclasses for mock in HiveSessionImplSuite
### What changes were proposed in this pull request?
Fix flaky test org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite by using subclasses to avoid classloader issue.

### Why are the changes needed?
It causes build instability.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
It is a fix for a flaky test, but need to run multiple times against Jenkins.

Closes #29069 from frankyin-factual/hive-tests.

Authored-by: Frank Yin <frank@factual.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-07-13 05:04:47 +09:00
Jungtaek Lim (HeartSaVioR) e6e43cb2f9 [SPARK-32242][SQL] CliSuite flakiness fix via differentiating cli driver bootup timeout and query execution timeout
### What changes were proposed in this pull request?

This patch tries to mitigate the flakiness of CliSuite, via below changes:

1. differentiate cli driver boot-up timeout (2 mins) and query execution timeout (parameter)

Cli driver boot-up is determined by master and app ID message. Given spark-sql doesn't print the message if `-e` option is specified, the patch simply add 2 mins on timeout for the case to cover the boot-up timeout.

2. don't fail the test even spark-sql doesn't gracefully shut down in 1 min.

3. extend timeout for `path command` test in CliSuite

### Why are the changes needed?

It took around 40 seconds for boot-up message (master: ... Application Id: ...) to be printed in stderr, while the overall timeout is 1 minute in many tests. This case the actual timeout for query execution is just 20 seconds, which may not be enough.

Some of the tests also failed with `org.scalatest.exceptions.TestFailedException: spark-sql did not exit gracefully`, which I don't feel the test has to be failed.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified with multiple triggers of Jenkins builds

Closes #29036 from HeartSaVioR/clisuite-flakiness-fix.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-07-10 13:12:25 +09:00
Jungtaek Lim (HeartSaVioR) ac6406e757 [SPARK-31831][SQL] HiveSessionImplSuite flakiness fix via mocking instances earlier than initializing HiveSessionImpl
### What changes were proposed in this pull request?

This patch changes the HiveSessionImplSuite to mock instances "before" initializing HiveSessionImpl, to avoid possible classloader issue.

### Why are the changes needed?

The failures of HiveSessionImplSuite always come from classloader issue. While I don't have clear idea what is happening, there's no part possibly dealing with classloader, except initializing HiveSessionImpl. We can move the mock initializations earlier than initialing HiveSessionImpl so that it can avoid possible classloader issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Verified with multiple triggers of Jenkins builds

Closes #29039 from HeartSaVioR/hive-session-impl-suite-flakiness-fix.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-09 14:32:20 -07:00
Ali Smesseim 8b0a54e6ff [SPARK-32057][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] ExecuteStatement: cancel and close should not transiently ERROR
### What changes were proposed in this pull request?
#28671 introduced a change where the order in which CANCELED state for SparkExecuteStatementOperation is set was changed. Before setting the state to CANCELED, `cleanup()` was called which kills the jobs, causing an exception to be thrown inside `execute()`. This causes the state to transiently become ERROR before being set to CANCELED. This PR fixes the order.

### Why are the changes needed?
Bug: wrong operation state is set.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test in SparkExecuteStatementOperationSuite.scala.

Closes #28912 from alismess-db/execute-statement-operation-cleanup-order.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-08 09:28:16 +09:00
Kent Yao 59a70879c0 [SPARK-32145][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message
### What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-29283, we only show the error message of root cause to end-users through JDBC client. In some cases, it erases the straightaway messages that we intentionally make to help them for better understanding.

The root cause is somehow obscure for JDBC end-users who only writing SQL queries.

e.g
```
Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.;
```

is better than just

```
Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2
```

We should do as Hive does in https://issues.apache.org/jira/browse/HIVE-14368

In general, this PR partially reverts SPARK-29283, ports HIVE-14368, and improves test coverage

### Why are the changes needed?

1. Do the same as Hive 2.3 and later for getting an error message in ThriftCLIService.GetOperationStatus
2. The root cause is somehow obscure for JDBC end-users who only writing SQL queries.
3. Consistency with `spark-sql` script

### Does this PR introduce _any_ user-facing change?

Yes, when running queries using thrift server and an error occurs, you will get the full stack traces instead of only the message of the root cause

### How was this patch tested?

add unit test

Closes #28963 from yaooqinn/SPARK-32145.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-06 10:34:31 +00:00
Kent Yao 9f8e15bb2e [SPARK-32034][SQL] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
### What changes were proposed in this pull request?

This PR port https://issues.apache.org/jira/browse/HIVE-14817 for spark thrift server.

### Why are the changes needed?

When stopping the HiveServer2, the non-daemon thread stops the server from terminating

```sql
"HiveServer2-Background-Pool: Thread-79" #79 prio=5 os_prio=31 tid=0x00007fde26138800 nid=0x13713 waiting on condition [0x0000700010c32000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hive.service.cli.session.SessionManager$1.sleepInterval(SessionManager.java:178)
	at org.apache.hive.service.cli.session.SessionManager$1.run(SessionManager.java:156)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

Here is an example to reproduce:
https://github.com/yaooqinn/kyuubi/blob/master/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/spark/SparkSQLEngineApp.scala

Also, it causes issues as HIVE-14817 described which

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

Passing Jenkins

Closes #28870 from yaooqinn/SPARK-32034.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-21 16:28:00 -07:00
Kent Yao 93529a8536 [SPARK-31957][SQL] Cleanup hive scratch dir for the developer api startWithContext
### What changes were proposed in this pull request?

Comparing to the long-running ThriftServer via start-script, we are more likely to hit the issue https://issues.apache.org/jira/browse/HIVE-10415 / https://issues.apache.org/jira/browse/SPARK-31626 in the developer API `startWithContext`

This PR apply SPARK-31626 to the developer API `startWithContext`

### Why are the changes needed?

Fix the issue described in  SPARK-31626

### Does this PR introduce _any_ user-facing change?

Yes, the hive scratch dir will be deleted if cleanup is enabled for calling `startWithContext`

### How was this patch tested?

new test

Closes #28784 from yaooqinn/SPARK-31957.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-19 19:54:46 -07:00
Kent Yao abc8ccc37b [SPARK-31926][SQL][TESTS][FOLLOWUP][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber
### What changes were proposed in this pull request?

This PR brings https://github.com/apache/spark/pull/28751 back

- It once reverted by 4a25200 because of inevitable maven test failure
    - See related updates in this followup a0187cd6b5

- And reverted again because of the flakiness of the added unit tests
   - In this PR, The flakiness reason found is caused by the hive metastore connection that the SparkSQLCLIService trying to create which turns out is unnecessary at all. This metastore client points to a dummy metastore server only.
   - Also, add some cleanups for SharedThriftServer trait in before and after to prevent its configurations being polluted or polluting others

### Why are the changes needed?

fix flaky test

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing sbt and maven tests

Closes #28835 from yaooqinn/SPARK-31926-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-19 05:58:54 +00:00
Max Gekk 350aa859fe [SPARK-32006][SQL] Create date/timestamp formatters once before collect in hiveResultString()
### What changes were proposed in this pull request?
1. Add method `getTimeFormatters` to `HiveResult` which creates timestamp and date formatters.
2. Move creation of `dateFormatter` and `timestampFormatter` from the constructor of the `HiveResult` object to `HiveResult. hiveResultString()` via `getTimeFormatters`. This allows to resolve time zone ID from Spark's session time zone `spark.sql.session.timeZone` and create date/timestamp formatters only once before collecting `java.sql.Timestamp`/`java.sql.Date` values.
3. Create date/timestamp formatters once in SparkExecuteStatementOperation.

### Why are the changes needed?
To fix perf regression comparing to Spark 2.4

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- By existing test suite `HiveResultSuite` and etc.
- Re-generate benchmarks results of `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28842 from MaxGekk/opt-toHiveString-oss-master.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-17 06:28:47 +00:00
yangjie01 d24d27f1bc [SPARK-31997][SQL][TESTS] Drop test_udtf table when SingleSessionSuite test completed
### What changes were proposed in this pull request?
`SingleSessionSuite` not do `DROP TABLE IF EXISTS test_udtf` when test completed, then if we do mvn test `HiveThriftBinaryServerSuite`, the test case named `SPARK-11595 ADD JAR with input path having URL scheme` will FAILED because it want to re-create an exists table test_udtf.

### Why are the changes needed?
test suite shouldn't rely on their execution order

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Manual test,mvn test SingleSessionSuite and HiveThriftBinaryServerSuite in order

Closes #28838 from LuciferYang/drop-test-table.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-16 19:20:44 +09:00
Dongjoon Hyun 75afd88904 Revert "[SPARK-31926][SQL][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber"
This reverts commit a0187cd6b5.
2020-06-15 19:04:23 -07:00
Kent Yao a0187cd6b5 [SPARK-31926][SQL][TEST-HIVE1.2][TEST-MAVEN] Fix concurrency issue for ThriftCLIService to getPortNumber
### What changes were proposed in this pull request?

This PR brings 02f32cfae4 back which reverted by 4a25200cd7 because of maven test failure

diffs newly made:
1. add a missing log4j file to test resources
2. Call `SessionState.detachSession()` to clean the thread local one in `afterAll`.
3. Not use dedicated JVMs for sbt test runner too

### Why are the changes needed?

fix the maven test

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

add new tests

Closes #28797 from yaooqinn/SPARK-31926-NEW.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-15 06:10:24 +00:00
Kousuke Saruta 88a4e55fae [SPARK-31765][WEBUI][TEST-MAVEN] Upgrade HtmlUnit >= 2.37.0
### What changes were proposed in this pull request?

This PR upgrades HtmlUnit.
Selenium and Jetty also upgraded because of dependency.
### Why are the changes needed?

Recently, a security issue which affects HtmlUnit is reported.
https://nvd.nist.gov/vuln/detail/CVE-2020-5529
According to the report, arbitrary code can be run by malicious users.
HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing testcases.

Closes #28585 from sarutak/upgrade-htmlunit.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-06-11 18:27:53 -05:00
Dongjoon Hyun 4a25200cd7 Revert "[SPARK-31926][SQL][TEST-HIVE1.2] Fix concurrency issue for ThriftCLIService to getPortNumber"
This reverts commit 02f32cfae4.
2020-06-10 17:21:03 -07:00
Kent Yao 02f32cfae4 [SPARK-31926][SQL][TEST-HIVE1.2] Fix concurrency issue for ThriftCLIService to getPortNumber
### What changes were proposed in this pull request?

When` org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext` called,
it starts `ThriftCLIService` in the background with a new Thread, at the same time we call `ThriftCLIService.getPortNumber,` we might not get the bound port if it's configured with 0.

This PR moves the  TServer/HttpServer initialization code out of that new Thread.

### Why are the changes needed?

Fix concurrency issue, improve test robustness.

### Does this PR introduce _any_ user-facing change?

NO
### How was this patch tested?

add new tests

Closes #28751 from yaooqinn/SPARK-31926.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 16:49:40 +00:00
Max Gekk 2c9988eaf3 [SPARK-31910][SQL] Enable Java 8 time API in Thrift server
### What changes were proposed in this pull request?
Set `spark.sql.datetime.java8API.enabled` to `true` in:
1. `SparkSQLEnv.init()` of Thrift server, and
2. `SparkSQLSessionManager.openSession()`

### Why are the changes needed?
1. Date and timestamp string literals are parsed by using Java 8 time API and Spark's session time zone. Before the changes, date/timestamp values were collected as legacy types `java.sql.Date`/`java.sql.Timestamp`, and the value of such types didn't respect the config `spark.sql.session.timeZone`. To have consistent view, users had to keep JVM time zone and Spark's session time zone in sync.
2. After the changes, formatting of date values doesn't depend on JVM time zone.
3. While returning dates/timestamps of Java 8 type, we can avoid dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian), and the issues related to calendar switching.
4. Properly handle negative years (BCE).
5. Consistent conversion of date/timestamp strings to/from internal Catalyst types in both direction to and from Spark.

### Does this PR introduce any user-facing change?
Yes. Before:
```sql
spark-sql> select make_date(-44, 3, 15);
0045-03-15
```
After:
```sql
spark-sql> select make_date(-44, 3, 15);
-0044-03-15
```

### How was this patch tested?
Manually via `bin/spark-sql`.

Closes #28729 from MaxGekk/enable-java8-time-api-in-thrift-server.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-05 14:18:16 +00:00
Juliusz Sompolski ea010138e9 [SPARK-31859][SPARK-31861][FOLLOWUP] Fix typo in tests
### What changes were proposed in this pull request?

It appears I have unintentionally used nested JDBC statements in the two tests I added.

### Why are the changes needed?

Cleanup a typo. Please merge to master/branch-3.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #28735 from juliuszsompolski/SPARK-31859-fixup.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-05 10:01:00 +00:00
HyukjinKwon a8266e44d4 Revert "[SPARK-28624][SQL][TESTS] Run date.sql via Thrift Server"
This reverts commit a4195d28ae.
2020-06-05 15:43:11 +09:00
Max Gekk a4195d28ae [SPARK-28624][SQL][TESTS] Run date.sql via Thrift Server
### What changes were proposed in this pull request?
Enable `date.sql` and run it via Thrift Server in `ThriftServerQueryTestSuite`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the enabled tests via:
```
$ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *ThriftServerQueryTestSuite -- -z date.sql"
```

Closes #28721 from MaxGekk/enable-date.sql-for-thrift.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-04 10:45:36 +09:00
HyukjinKwon baafd4386c Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0"
This reverts commit e5c3463910.
2020-06-03 14:15:30 +09:00
Kousuke Saruta e5c3463910 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0
### What changes were proposed in this pull request?

This PR upgrades HtmlUnit.
Selenium and Jetty also upgraded because of dependency.
### Why are the changes needed?

Recently, a security issue which affects HtmlUnit is reported.
https://nvd.nist.gov/vuln/detail/CVE-2020-5529
According to the report, arbitrary code can be run by malicious users.
HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing testcases.

Closes #28585 from sarutak/upgrade-htmlunit.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-06-02 08:29:07 -05:00
Juliusz Sompolski af35691de4 [SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues
### What changes were proposed in this pull request?

Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone.

If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is:
* The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC.
* When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet.
* Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation.
* The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone)

The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861.

Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859.

Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863.

I used the opportunity to move some repetitive code from the operations to the mixin helper trait.

Closes #28671 from juliuszsompolski/SPARK-31861.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-30 06:14:32 +00:00
Kent Yao fe1da296da [SPARK-31833][SQL][TEST-HIVE1.2] Set HiveThriftServer2 with actual port while configured 0
### What changes were proposed in this pull request?

When I was developing some stuff based on the `DeveloperAPI ` `org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext`, I need to use thrift port randomly to avoid race on ports. But the `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` do not respond to me with the actual bound port but always 0.

And the server log is not right too, after starting the server, it's hard to form to the right JDBC connection string.

```
INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 0 with 5...500 worker threads
```

Indeed, the `53742` is the right port
```shell
lsof -nP -p `cat ./pid/spark-kentyao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid` | grep LISTEN
java    18990 kentyao  288u    IPv6 0x22858e3e60d6a0a7      0t0                 TCP 10.242.189.214:53723 (LISTEN)
java    18990 kentyao  290u    IPv6 0x22858e3e60d68827      0t0                 TCP *:4040 (LISTEN)
java    18990 kentyao  366u    IPv6 0x22858e3e60d66987      0t0                 TCP 10.242.189.214:53724 (LISTEN)
java    18990 kentyao  438u    IPv6 0x22858e3e60d65d47      0t0                 TCP *:53742 (LISTEN)
```

In the PR, when the port is configured 0, the `portNum` will be set to the real used port during the start process.

Also use 0 in thrift related tests to avoid potential flakiness.

### Why are the changes needed?

1 fix API bug
2 reduce test flakiness

### Does this PR introduce _any_ user-facing change?

yes, `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` will always give you the actual port when it is configured 0.

### How was this patch tested?

modified unit tests

Closes #28651 from yaooqinn/SPARK-31833.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-29 04:13:23 +00:00
iRakson 765105b6f1 [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages
### What changes were proposed in this pull request?

Pagination code across pages needs to be cleaned.
I have tried to clear out these things :
* Unused methods
* Unused method arguments
* remove redundant `if` expressions
* fix indentation

### Why are the changes needed?
This fix will make code more readable and remove unnecessary methods and variables.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually

Closes #28448 from iRakson/refactorPagination.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-27 08:59:08 -05:00
Gengliang Wang db5e5fce68 Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0"
This reverts commit 92877c4ef2.

Closes #28602 from gengliangwang/revertSPARK-31765.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-05-21 16:00:58 -07:00
Kousuke Saruta 92877c4ef2 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0
### What changes were proposed in this pull request?

This PR upgrades HtmlUnit.
Selenium and Jetty also upgraded because of dependency.
### Why are the changes needed?

Recently, a security issue which affects HtmlUnit is reported.
https://nvd.nist.gov/vuln/detail/CVE-2020-5529
According to the report, arbitrary code can be run by malicious users.
HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing testcases.

Closes #28585 from sarutak/upgrade-htmlunit.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-05-21 11:43:25 -07:00
iRakson f1495c5bc0 [SPARK-31688][WEBUI] Refactor Pagination framework
### What changes were proposed in this pull request?
Currently while implementing pagination using the existing pagination framework, a lot of code is being copied as pointed out [here](https://github.com/apache/spark/pull/28485#pullrequestreview-408881656).

I introduced some changes in `PagedTable` which is the main trait for implementing the pagination.
* Added function for getting table parameters.
* Added a function for table header row. This will help in maintaining consistency across the tables. All the header rows across tables will be consistent now.

### Why are the changes needed?

* A lot of code is copied every time pagination is implemented for any table.
* Code readability is not great as lot of HTML is embedded.
* Paginating other tables will be a lot easier now.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually. This is mainly refactoring work, no new functionality introduced. Existing test cases should pass.

Closes #28512 from iRakson/refactorPaginationFramework.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-21 13:00:00 -05:00
Ali Smesseim d40ecfa3f7
[SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener
### What changes were proposed in this pull request?

This is a recreation of #28155, which was reverted due to causing test failures.

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

To improve robustness, we also make the following changes in HiveSessionImpl.close():

- Catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.
- Handle not being able to access the scratch directory. When closing, all `.pipeout` files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28544 from alismess-db/hive-thriftserver-listener-update-safer-2.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-20 10:30:17 -07:00
Kent Yao 1d66085a93 [SPARK-31289][TEST][TEST-HIVE1.2] Eliminate org.apache.spark.sql.hive.thriftserver.CliSuite flakiness
### What changes were proposed in this pull request?

CliSuite seems to be flaky while using metastoreDir per test.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/org.apache.spark.sql.hive.thriftserver/CliSuite/

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/junit/org.apache.spark.sql.hive.thriftserver/CliSuite/history/

According to the error stack trace in the failed test, the test failed to instantiate a hive metastore client because of derby requirements.
```scala
Caused by: ERROR XBM0A:
 The database directory '/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-9249ce52-0a06-42b6-a3df-e6295e880df0' exists.
However, it does not contain the expected 'service.properties' file.
Perhaps Derby was brought down in the middle of creating this database.
You may want to delete this directory and try creating the database again.
```

The derby requires the metastore dir does not exist, but it does exist probably due to the test case before it failed to clear the metastore dir

In this PR, the metastore is shared across the tests of CliSuite except those explicitly asked a separated metastore env itself

### Why are the changes needed?

CliSuite seems to be flaky while using metastoreDir per test.
 To eliminate test flakiness
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

modified test

Closes #28055 from yaooqinn/clisuite.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-16 09:11:21 +00:00