Commit graph

25792 commits

Author SHA1 Message Date
Kent Yao 2dd6807e42 [SPARK-28023][SQL] Add trim logic in UTF8String's toInt/toLong to make it consistent with other string-numeric casting
### What changes were proposed in this pull request?

Modify `UTF8String.toInt/toLong` to support trim spaces for both sides before converting it to byte/short/int/long.

With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from https://github.com/apache/spark/pull/24872#issuecomment-556917834

### Why are the changes needed?

make the behavior consistent.

### Does this PR introduce any user-facing change?
yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double.
### How was this patch tested?
1. add ut.
2. benchmark tests
 the benchmark is modified based on https://github.com/apache/spark/pull/24872#issuecomment-503827016

```scala
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark

/**
 * Benchmark trim the string when casting string type to Boolean/Numeric types.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
 *   2. build/sbt "sql/test:runMain <this class>"
 *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
 *      Results will be written to "benchmarks/CastBenchmark-results.txt".
 * }}}
 */
object CastBenchmark extends SqlBasedBenchmark {
This conversation was marked as resolved by yaooqinn

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val title = "Cast String to Integral"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 500L << 14
        val df = spark.range(N)
        val types = Seq("int", "long")
        (1 to 5).by(2).foreach { i =>
          df.selectExpr(s"concat(id, '${" " * i}') as str")
            .write.mode("overwrite").parquet(dir + i.toString)
        }

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(true, false).foreach { trim =>
          types.foreach { t =>
            val str = if (trim) "trim(str)" else "str"
            val expr = s"cast($str as $t) as c_$t"
            (1 to 5).by(2).foreach { i =>
              benchmark.addCase(expr + s" - with $i spaces") { _ =>
                spark.read.parquet(dir + i.toString).selectExpr(expr).collect()
              }
            }
          }
        }
        benchmark.run()
      }
    }
  }
}
```
#### benchmark result.
normal trim v.s. trim in toInt/toLong
```java
================================================================================================
Cast String to Integral
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU  2.90GHz
Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast(trim(str) as int) as c_int - with 1 spaces          10220          12994        1337          0.8        1247.5       1.0X
cast(trim(str) as int) as c_int - with 3 spaces           4763           8356         357          1.7         581.4       2.1X
cast(trim(str) as int) as c_int - with 5 spaces           4791           8042         NaN          1.7         584.9       2.1X
cast(trim(str) as long) as c_long - with 1 spaces           4014           6755         NaN          2.0         490.0       2.5X
cast(trim(str) as long) as c_long - with 3 spaces           4737           6938         NaN          1.7         578.2       2.2X
cast(trim(str) as long) as c_long - with 5 spaces           4478           6919        1404          1.8         546.6       2.3X
cast(str as int) as c_int - with 1 spaces           4443           6222         NaN          1.8         542.3       2.3X
cast(str as int) as c_int - with 3 spaces           3659           3842         170          2.2         446.7       2.8X
cast(str as int) as c_int - with 5 spaces           4372           7996         NaN          1.9         533.7       2.3X
cast(str as long) as c_long - with 1 spaces           3866           5838         NaN          2.1         471.9       2.6X
cast(str as long) as c_long - with 3 spaces           3793           5449         NaN          2.2         463.0       2.7X
cast(str as long) as c_long - with 5 spaces           4947           5961        1198          1.7         603.9       2.1X
```

Closes #26622 from yaooqinn/cheapstringtrim.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-22 19:32:27 +08:00
LantaoJin 9ec2a4e58c [SPARK-29911][SQL][FOLLOWUP] Move related unit test to ThriftServerWithSparkContextSuite
### What changes were proposed in this pull request?
This is follow up of #26543

See https://github.com/apache/spark/pull/26543#discussion_r348934276

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exist UT.

Closes #26628 from LantaoJin/SPARK-29911_FOLLOWUP.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-22 18:36:50 +09:00
Wenchen Fan e2f056f4a8 [SPARK-29975][SQL] introduce --CONFIG_DIM directive
### What changes were proposed in this pull request?

allow the sql test files to specify different dimensions of config sets during testing. For example,
```
--CONFIG_DIM1 a=1
--CONFIG_DIM1 b=2,c=3

--CONFIG_DIM2 x=1
--CONFIG_DIM2 y=1,z=2
```

This example defines 2 config dimensions, and each dimension defines 2 config sets. We will run the queries 4 times:
1. a=1, x=1
2. a=1, y=1, z=2
3. b=2, c=3, x=1
4. b=2, c=3, y=1, z=2

### Why are the changes needed?

Currently `SQLQueryTestSuite` takes a long time. This is because we run each test at least 3 times, to check with different codegen modes. This is not necessary for most of the tests, e.g. DESC TABLE. We should only check these codegen modes for certain tests.

With the --CONFIG_DIM directive, we can do things like: test different join operator(broadcast or shuffle join) X different codegen modes.

After reducing testing time, we should be able to run thrifter server SQL tests with config settings.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

test only

Closes #26612 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-22 10:56:28 +09:00
Wenchen Fan 6b4b6a87cd [SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive
### What changes were proposed in this pull request?

Make `ResolveRelations` call `ResolveTables` at the beginning, and make `ResolveTables` call `ResolveTempViews`(newly added) at the beginning, to ensure the relation resolution priority.

### Why are the changes needed?

To resolve an `UnresolvedRelation`, the general process is:
1. try to resolve to (global) temp view first. If it's not a temp view, move on
2. if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog.
3. when looking up table from session catalog, return a v1 relation if the table provider is v1.

Currently, this process is done by 2 rules: `ResolveTables` and `ResolveRelations`. To avoid rule conflicts, we add a lot of checks:
1. `ResolveTables` only resolves `UnresolvedRelation` if it's not a temp view and the resolved table is not v1.
2. `ResolveRelations` only resolves `UnresolvedRelation` if the table name has less than 2 parts.

This requires to run `ResolveTables` before `ResolveRelations`, otherwise we may resolve a v2 table to a v1 relation.

To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple:
1. first run `ResolveTempViews`, see if we can resolve relation to temp view
2. then run `ResolveTables`, see if we can resolve relation to v2 tables.
3. finally run `ResolveRelations`, see if we can resolve relation to v1 tables.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26214 from cloud-fan/resolve.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-11-21 09:47:42 -08:00
Ximo Guanter 54c5087a3a [SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory
### What changes were proposed in this pull request?
When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.

However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.

This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.

### Why are the changes needed?
Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:

it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)
### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tests passed

Closes #26591 from edrevo/temp.

Authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-22 00:19:25 +08:00
Takeshi Yamamuro cdcd43cbf2 [SPARK-29977][SQL] Remove newMutableProjection/newOrdering/newNaturalAscendingOrdering from SparkPlan
### What changes were proposed in this pull request?

This is to refactor `SparkPlan` code; it mainly removed `newMutableProjection`/`newOrdering`/`newNaturalAscendingOrdering` from `SparkPlan`.
The other modifications are listed below;
 - Move `BaseOrdering` from `o.a.s.sqlcatalyst.expressions.codegen.GenerateOrdering.scala` to `o.a.s.sqlcatalyst.expressions.ordering.scala`
 - `RowOrdering` extends `CodeGeneratorWithInterpretedFallback ` for `BaseOrdering`
 - Remove the unused variables (`subexpressionEliminationEnabled` and `codeGenFallBack`) from `SparkPlan`

### Why are the changes needed?

For better code/test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing.

Closes #26615 from maropu/RefactorOrdering.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-21 23:51:12 +08:00
angerszhu 6146dc4562 [SPARK-29874][SQL] Optimize Dataset.isEmpty()
### What changes were proposed in this pull request?
In  origin way to judge if a DataSet is empty by
```
 def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
  }
```
will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver.
In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process.

We change it to
```
  def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
    plan.executeTake(1).isEmpty
  }
```
After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API.
then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side.

### Why are the changes needed?
Optimize Dataset.isEmpty()

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Origin UT

Closes #26500 from AngersZhuuuu/SPARK-29874.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-21 18:43:21 +08:00
zhengruifeng 0f40d2a6ee [SPARK-29960][ML][PYSPARK] MulticlassClassificationEvaluator support hammingLoss
### What changes were proposed in this pull request?
MulticlassClassificationEvaluator support hammingLoss

### Why are the changes needed?
1, it is an easy to compute hammingLoss based on confusion matrix
2, scikit-learn supports it

### Does this PR introduce any user-facing change?
yes

### How was this patch tested?
added testsuites

Closes #26597 from zhengruifeng/multi_class_hamming_loss.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-11-21 18:32:28 +08:00
zhengruifeng 297cbab98e [SPARK-29942][ML] Impl Complement Naive Bayes Classifier
### What changes were proposed in this pull request?
Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes`

### Why are the changes needed?
1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.'
2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB.

### Does this PR introduce any user-facing change?
yes, a new `modelType` is supported

### How was this patch tested?
added testsuites

Closes #26575 from zhengruifeng/cnb.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-11-21 18:22:05 +08:00
gengjiaan 85c004d5b0 [SPARK-29885][PYTHON][CORE] Improve the exception message when reading the daemon port
### What changes were proposed in this pull request?
In production environment, my PySpark application occurs an exception and it's message as below:
```
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
	at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
```
At first, I think a physical node has many ports are occupied by a large number of processes.
But I found the total number of ports in use is only 671.
```
[yarnr1115 ~]$ netstat -a | wc -l 671
671
```
I  checked the code of PythonWorkerFactory in line 204 and found:
```
 daemon = pb.start()
 val in = new DataInputStream(daemon.getInputStream)
 try {
 daemonPort = in.readInt()
 } catch {
 case _: EOFException =>
 throw new SparkException(s"No port number in $daemonModule's stdout")
 }
```
I added some code here:
```
logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
logError("Exit value: ${daemon.exitValue()}")
```
Then I recurrent the exception and it's message as below:
```
19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is alive: false
19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
 at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
```
I think the exception message has caused me a lot of confusion.
This PR will add meaningful log for exception information.

### Why are the changes needed?
In order to clarify the exception and try three times default.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Exists UT.

Closes #26510 from beliefer/improve-except-message.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 16:13:42 +09:00
Dongjoon Hyun affaefe1f3 [MINOR][INFRA] Add io and net to GitHub Action Cache
### What changes were proposed in this pull request?

This PR aims to cache `~/.m2/repository/net` and `~/.m2/repository/io` to reduce the flakiness.

### Why are the changes needed?

This will stabilize GitHub Action more before adding `hive-1.2` and `hive-2.3` combination.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

After the GitHub Action on this PR passes, check the log.

Closes #26621 from dongjoon-hyun/SPARK-GHA-CACHE.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 15:43:57 +09:00
Kent Yao d555f8fcc9 [SPARK-29961][SQL][FOLLOWUP] Remove useless test for VectorUDT
### What changes were proposed in this pull request?

A follow-up to rm useless test in VectorUDTSuite

### Why are the changes needed?

rm useless test, which is already covered.
### Does this PR introduce any user-facing change?

no

### How was this patch tested?

no

Closes #26620 from yaooqinn/SPARK-29961-f.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-20 21:02:22 -08:00
HyukjinKwon 74cb1ffd68 [SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode
### What changes were proposed in this pull request?

This PR proposes to show different warning message when the pinned thread mode is enabled:

When enabled:

> PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.

When disabled:

> Currently, 'setLocalProperty' (set to local properties) with multiple threads does not properly work.
> Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
> To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.

### Why are the changes needed?

Currently, it shows the same warning message regardless of PYSPARK_PIN_THREAD being set. In the warning message it says "you can set PYSPARK_PIN_THREAD to true ..." which is confusing.

### Does this PR introduce any user-facing change?

Documentation and warning message as shown above.

### How was this patch tested?

Manually tested.

```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```

```python
sc.setJobGroup("a", "b")
```

```
.../pyspark/util.py:141: UserWarning: PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
  warnings.warn(msg, UserWarning)
```

```bash
$ ./bin/pyspark
```

```python
sc.setJobGroup("a", "b")
```

```
.../pyspark/util.py:141: UserWarning: Currently, 'setJobGroup' (set to local properties) with multiple threads does not properly work.
Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
  warnings.warn(msg, UserWarning)
```

Closes #26588 from HyukjinKwon/SPARK-22340.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 10:54:01 +09:00
Kent Yao 7a70670345 [SPARK-29961][SQL] Implement builtin function - typeof
### What changes were proposed in this pull request?
Add typeof function for Spark to get the underlying type of value.
```sql
-- !query 0
select typeof(1)
-- !query 0 schema
struct<typeof(1):string>
-- !query 0 output
int

-- !query 1
select typeof(1.2)
-- !query 1 schema
struct<typeof(1.2):string>
-- !query 1 output
decimal(2,1)

-- !query 2
select typeof(array(1, 2))
-- !query 2 schema
struct<typeof(array(1, 2)):string>
-- !query 2 output
array<int>

-- !query 3
select typeof(a) from (values (1), (2), (3.1)) t(a)
-- !query 3 schema
struct<typeof(a):string>
-- !query 3 output
decimal(11,1)
decimal(11,1)
decimal(11,1)

```

##### presto

```sql
presto> select typeof(array[1]);
     _col0
----------------
 array(integer)
(1 row)
```
##### PostgreSQL

```sql
postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a);
 pg_typeof
-----------
 numeric
 numeric
 numeric
(3 rows)
```
##### impala
https://issues.apache.org/jira/browse/IMPALA-1597

### Why are the changes needed?
a function which is better we have to help us debug, test, develop ...

### Does this PR introduce any user-facing change?

add a new function
### How was this patch tested?

add ut and example

Closes #26599 from yaooqinn/SPARK-29961.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 10:28:32 +09:00
Maxim Gekk e6b157cf70 [SPARK-29978][SQL][TESTS] Check json_tuple does not truncate results
### What changes were proposed in this pull request?
I propose to add a test from the commit a936522113 for 2.4. I extended the test by a few more lengths of requested field to cover more code branches in Jackson Core. In particular, [the optimization](5eb8973f87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (L473-L476)) calls Jackson's method 42b8b56684/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java (L742-L746) where the internal buffer size is **8000**. In this way:
- 2000 to check 2000+2000+2000 < 8000
- 2800 from the 2.4 commit. It covers the specific case: 42b8b56684/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java (L746)
- 8000-1, 8000, 8000+1 are sizes around the size of the internal buffer
- 65535 to test an outstanding large field.

### Why are the changes needed?
To be sure that the current implementation and future versions of Spark don't have the bug fixed in 2.4.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `JsonFunctionsSuite`.

Closes #26613 from MaxGekk/json_tuple-test.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 09:59:31 +09:00
LantaoJin 06e203b856 [SPARK-29911][SQL] Uncache cached tables when session closed
### What changes were proposed in this pull request?
The local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it.  But now cache data is cross-session. Its lifetime is the lifetime of the Spark application. That's will cause the memory leak if cache a local temporary view in memory when the session closed.
In this PR, we uncache the cached data of local temporary view when session closed. This PR doesn't impact the cached data of global temp view and persisted view.

How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop table v1.

The application will hold the memory forever. In a long running thrift server scenario. It's worse.
```shell
0: jdbc:hive2://localhost:10000> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:10000> !close
!close
Closing: 0: jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000 (closed)> !connect 'jdbc:hive2://localhost:10000'
!connect 'jdbc:hive2://localhost:10000'
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
lajin
Enter password for jdbc:hive2://localhost:10000:
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:10000> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
```
<img width="1047" alt="Screen Shot 2019-11-15 at 2 03 49 PM" src="https://user-images.githubusercontent.com/1853780/68923527-7ca8c180-07b9-11ea-9cc7-74f276c46840.png">

### Why are the changes needed?
Resolve memory leak for thrift server

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manual test in UI storage tab
And add an UT

Closes #26543 from LantaoJin/SPARK-29911.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 18:19:30 -06:00
Sean Owen 1febd373ea [MINOR][TESTS] Replace JVM assert with JUnit Assert in tests
### What changes were proposed in this pull request?

Use JUnit assertions in tests uniformly, not JVM assert() statements.

### Why are the changes needed?

assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.

### Does this PR introduce any user-facing change?

No. The assertion logic should be identical.

### How was this patch tested?

Existing tests.

Closes #26581 from srowen/assertToJUnit.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 14:04:15 -06:00
Yuanjian Li 23b3c4fafd [SPARK-29951][SQL] Make the behavior of Postgre dialect independent of ansi mode config
### What changes were proposed in this pull request?
Fix the inconsistent behavior of build-in function SQL LEFT/RIGHT.

### Why are the changes needed?
As the comment in https://github.com/apache/spark/pull/26497#discussion_r345708065, Postgre dialect should not be affected by the ANSI mode config.
During reran the existing tests, only the LEFT/RIGHT build-in SQL function broke the assumption. We fix this by following https://www.postgresql.org/docs/12/sql-keywords-appendix.html: `LEFT/RIGHT reserved (can be function or type)`

### Does this PR introduce any user-facing change?
Yes, the Postgre dialect will not be affected by the ANSI mode config.

### How was this patch tested?
Existing UT.

Closes #26584 from xuanyuanking/SPARK-29951.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-21 00:56:48 +08:00
Takeshi Yamamuro 6eeb131941 [SPARK-28885][SQL][FOLLOW-UP] Re-enable the ported PgSQL regression tests of SQLQueryTestSuite
### What changes were proposed in this pull request?

SPARK-28885(#26107) has supported the ANSI store assignment rules and stopped running some ported PgSQL regression tests that violate the rules. To re-activate these tests, this pr is to modify them for passing tests with the rules.

### Why are the changes needed?

To make the test coverage better.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26492 from maropu/SPARK-28885-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-20 08:32:13 -08:00
Luca Canali b5df40bd87 [SPARK-29894][SQL][WEBUI] Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
### What changes were proposed in this pull request?
The Web UI SQL Tab provides information on the executed SQL using plan graphs and by reporting SQL execution plans. Both sources provide useful information. Physical execution plans report Codegen Stage Ids. This PR adds Codegen Stage Ids to the plan graphs.

### Why are the changes needed?
It is useful to have Codegen Stage Id information also reported in plan graphs, this allows to more easily match physical plans and graphs with metrics when troubleshooting SQL execution.
Example snippet to show the proposed change:

![](https://issues.apache.org/jira/secure/attachment/12985837/snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png)

Example of the current state:
![](https://issues.apache.org/jira/secure/attachment/12985838/snippet_plan_graph_before_patch.png)

Physical plan:
![](https://issues.apache.org/jira/secure/attachment/12985932/Physical_plan_Annotated.png)

### Does this PR introduce any user-facing change?
This PR adds Codegen Stage Id information to SQL plan graphs in the Web UI/SQL Tab.

### How was this patch tested?
Added a test + manually tested

Closes #26519 from LucaCanali/addCodegenStageIdtoWEBUIGraphs.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-20 23:20:33 +08:00
Huaxin Gao 56a65b971d [SPARK-18409][ML] LSH approxNearestNeighbors should use approxQuantile instead of sort
### What changes were proposed in this pull request?
```LSHModel.approxNearestNeighbors``` sorts the full dataset on the hashDistance in order to find a threshold. This PR uses approxQuantile instead.

### Why are the changes needed?
To improve performance.

### Does this PR introduce any user-facing change?
Yes.
Changed ```LSH``` to make it extend ```HasRelativeError```
```LSH``` and ```LSHModel``` have new APIs ```setRelativeError/getRelativeError```

### How was this patch tested?
Existing tests. Also added a couple doc test in python to test newly added ```getRelativeError```

Closes #26415 from huaxingao/spark-18409.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 08:20:16 -06:00
Takeshi Yamamuro 0032d85153 [SPARK-29968][SQL] Remove the Predicate code from SparkPlan
### What changes were proposed in this pull request?

This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`.
Modifications are listed below;
 - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala`
 - To resolve the name conflict,  rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate`
 - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate`

This comes from the cloud-fan suggestion: https://github.com/apache/spark/pull/26420#discussion_r348005497

### Why are the changes needed?

For better code/test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26604 from maropu/RefactorPredicate.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-20 21:13:51 +08:00
Nikita Konda 5a70af7a6c [SPARK-29029][SQL] Use AttributeMap in PhysicalOperation.collectProjectsAndFilters
### What changes were proposed in this pull request?

This PR fixes the issue of substituting aliases while collecting filters in  `PhysicalOperation.collectProjectsAndFilters`. When the `AttributeReference` in alias map differs from the `AttributeReference` in filter condition only in qualifier, it does not substitute alias and throws exception saying `key videoid#47L not found` in the following scenario.

```
[1] Project [userid#0]
+- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30))
   +- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0]
      +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10))
         +- [5] Relation[userid#0,videoid#1,avebitrate#2]
```

### Why are the changes needed?

We need to use `AttributeMap` where the key is `AttributeReference`'s `ExprId` instead of `Map[Attribute, Expression]` while collecting and substituting aliases in `PhysicalOperation.collectProjectsAndFilters`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New unit tests were added in `TestPhysicalOperation` which reproduces the bug

Closes #25761 from nikitagkonda/SPARK-29029-use-attributemap-for-aliasmap-in-physicaloperation.

Authored-by: Nikita Konda <nikita.konda@workday.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-19 20:01:42 -08:00
Wenchen Fan 9e58b10c8e [SPARK-29945][SQL] do not handle negative sign specially in the parser
### What changes were proposed in this pull request?

Remove the special handling of the negative sign in the parser (interval literal and type constructor)

### Why are the changes needed?

The negative sign is an operator (UnaryMinus). We don't need to handle it specially, which is kind of doing constant folding at parser side.

### Does this PR introduce any user-facing change?

The error message becomes a little different. Now it reports type mismatch for the `-` operator.

### How was this patch tested?

existing tests

Closes #26578 from cloud-fan/interval.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-11-20 11:08:04 +09:00
Maxim Gekk 40b8a08b8b [SPARK-29963][SQL][TESTS] Check formatting timestamps up to microsecond precision by JSON/CSV datasource
### What changes were proposed in this pull request?
In the PR, I propose to add tests from the commit 47cb1f359a for Spark 2.4 that check formatting of timestamp strings for various seconds fractions.

### Why are the changes needed?
To make sure that current behavior is the same as in Spark 2.4

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`.

Closes #26601 from MaxGekk/format-timestamp-micros-tests.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-20 10:34:25 +09:00
Liang-Chi Hsieh e753aa30e6 [SPARK-29964][BUILD] lintr github workflows failed due to buggy GnuPG
### What changes were proposed in this pull request?

Linter (R) github workflows failed sometimes like:

https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016

Failed message:
```
Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' failed: IPC connect call failed
gpg: keyserver receive failed: No dirmngr
##[error]Process completed with exit code 2.
```

It is due to a buggy GnuPG. Context:
https://github.com/sbt/website/pull/825
https://github.com/sbt/sbt/issues/4261
https://github.com/microsoft/WSL/issues/3286

### Why are the changes needed?

Make lint-r github workflows work.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass github workflows.

Closes #26602 from viirya/SPARK-29964.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-19 15:56:50 -08:00
John Bauer e804ed5e33 [SPARK-29691][ML][PYTHON] ensure Param objects are valid in fit, transform
modify Param._copyValues to check valid Param objects supplied as extra

### What changes were proposed in this pull request?

Estimator.fit() and Model.transform() accept a dictionary of extra parameters whose values are used to overwrite those supplied at initialization or by default.  Additionally, the ParamGridBuilder.addGrid accepts a parameter and list of values. The keys are presumed to be valid Param objects. This change adds a check that only Param objects are supplied as keys.

### Why are the changes needed?

Param objects are created by and bound to an instance of Params (Estimator, Model, or Transformer). They may be obtained from their parent as attributes, or by name through getParam.

The documentation does not state that keys must be valid Param objects, nor describe how one may be obtained. The current behavior is to silently ignore keys which are not valid Param objects.

### Does this PR introduce any user-facing change?

If the user does not pass in a Param object as required for keys in `extra` for Estimator.fit() and Model.transform(), and `param` for ParamGridBuilder.addGrid, an error will be raised indicating it is an invalid object.

### How was this patch tested?

Added method test_copy_param_extras_check to test_param.py.   Tested with Python 3.7

Closes #26527 from JohnHBauer/paramExtra.

Authored-by: John Bauer <john.h.bauer@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-11-19 14:15:00 -08:00
Wenchen Fan 3d2a6f464f [SPARK-29906][SQL] AQE should not introduce extra shuffle for outermost limit
### What changes were proposed in this pull request?

`AdaptiveSparkPlanExec` should forward `executeCollect` and `executeTake` to the underlying physical plan.

### Why are the changes needed?

some physical plan has optimization in `executeCollect` and `executeTake`. For example, `CollectLimitExec` won't do shuffle for outermost limit.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

a new test

This closes #26560

Closes #26576 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-11-19 10:39:38 -08:00
Jobit Mathew 6fb8b86065 [SPARK-29913][SQL] Improve Exception in postgreCastToBoolean
### What changes were proposed in this pull request?
Exception improvement.

### Why are the changes needed?
After selecting pgSQL dialect, queries which are failing because of wrong syntax will give long exception stack trace. For example,
`explain select cast ("abc" as boolean);`

Current output:

> ERROR SparkSQLDriver: Failed in [explain select cast ("abc" as boolean)]
> java.lang.IllegalArgumentException: invalid input syntax for type boolean: abc
> 	at org.apache.spark.sql.catalyst.expressions.postgreSQL.PostgreCastToBoolean.$anonfun$castToBoolean$2(PostgreCastToBoolean.scala:51)
> 	at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:277)
> 	at org.apache.spark.sql.catalyst.expressions.postgreSQL.PostgreCastToBoolean.$anonfun$castToBoolean$1(PostgreCastToBoolean.scala:44)
> 	at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:773)
> 	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:460)
> 	at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52)
> 	at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
> 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.
>       .
>       .
>       .

### Does this PR introduce any user-facing change?
Yes. After this PR, output for above query will be:

> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: invalid input syntax for type boolean: abc;
>
> Time taken: 0.044 seconds, Fetched 1 row(s)
> 19/11/15 15:38:57 INFO SparkSQLCLIDriver: Time taken: 0.044 seconds, Fetched 1 row(s)

### How was this patch tested?
Updated existing test cases.

Closes #26546 from jobitmathew/pgsqlexception.

Authored-by: Jobit Mathew <jobit.mathew@huawei.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 21:30:38 +08:00
Kent Yao 79ed4ae2db [SPARK-29926][SQL] Fix weird interval string whose value is only a dangling decimal point
### What changes were proposed in this pull request?

Currently, we support to parse '1. second' to 1s or even '. second' to 0s.

```sql
-- !query 118
select interval '1. seconds'
-- !query 118 schema
struct<1 seconds:interval>
-- !query 118 output
1 seconds

-- !query 119
select interval '. seconds'
-- !query 119 schema
struct<0 seconds:interval>
-- !query 119 output
0 seconds
```

```sql
postgres=# select interval '1. second';
ERROR:  invalid input syntax for type interval: "1. second"
LINE 1: select interval '1. second';

postgres=# select interval '. second';
ERROR:  invalid input syntax for type interval: ". second"
LINE 1: select interval '. second';
```
We fix this by fixing the new interval parser's VALUE_FRACTIONAL_PART state

With further digging, we found that 1. is valid in python, r, scala, and presto and so on... so this PR
ONLY forbid the invalid interval value in the form of  '. seconds'.

### Why are the changes needed?

bug fix

### Does this PR introduce any user-facing change?

yes, now we treat '. second' .... as invalid intervals
### How was this patch tested?

add ut

Closes #26573 from yaooqinn/SPARK-29926.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 21:01:26 +08:00
jiake a8d98833b8 [SPARK-29893] improve the local shuffle reader performance by changing the reading task number from 1 to multi
### What changes were proposed in this pull request?
This PR update the local reader task number from 1 to multi `partitionStartIndices.length`.

### Why are the changes needed?
Improve the performance of local shuffle reader.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs

Closes #26516 from JkSelf/improveLocalShuffleReader.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 19:18:08 +08:00
wangguangxin.cn ffc9753037 [SPARK-29918][SQL] RecordBinaryComparator should check endianness when compared by long
### What changes were proposed in this pull request?
This PR try to make sure the comparison results of  `compared by 8 bytes at a time` and `compared by bytes wise` in RecordBinaryComparator is *consistent*, by reverse long bytes if it is little-endian and using Long.compareUnsigned.

### Why are the changes needed?
If the architecture supports unaligned or the offset is 8 bytes aligned, `RecordBinaryComparator` compare 8 bytes at a time by reading 8 bytes as a long.  Related code is
```
    if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) {
      while (i <= leftLen - 8) {
        final long v1 = Platform.getLong(leftObj, leftOff + i);
        final long v2 = Platform.getLong(rightObj, rightOff + i);
        if (v1 != v2) {
          return v1 > v2 ? 1 : -1;
        }
        i += 8;
      }
    }
```

Otherwise, it will compare bytes by bytes.  Related code is
```
    while (i < leftLen) {
      final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff;
      final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff;
      if (v1 != v2) {
        return v1 > v2 ? 1 : -1;
      }
      i += 1;
    }
```

However, on little-endian machine,  the result of *compared by a long value* and *compared bytes by bytes* maybe different.

For two same records, its offsets may vary in the first run and second run, which will lead to compare them using long comparison or byte-by-byte comparison, the result maybe different.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Add new test cases in RecordBinaryComparatorSuite

Closes #26548 from WangGuangxin/binary_comparator.

Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 16:10:22 +08:00
Wenchen Fan 16134d6d0f [SPARK-29948][SQL] make the default alias consistent between date, timestamp and interval
### What changes were proposed in this pull request?

Update `Literal.sql` to make date, timestamp and interval consistent. They should all use the `TYPE 'value'` format.

### Why are the changes needed?

Make the default alias consistent. For example, without this patch we will see
```
scala> sql("select interval '1 day', date '2000-10-10'").show
+------+-----------------+
|1 days|DATE '2000-10-10'|
+------+-----------------+
|1 days|       2000-10-10|
+------+-----------------+
```

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26579 from cloud-fan/sql.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 15:37:35 +08:00
LantaoJin 5ac37a8265 [SPARK-29869][SQL] improve error message in HiveMetastoreCatalog#convertToLogicalRelation
### What changes were proposed in this pull request?
In our production, HiveMetastoreCatalog#convertToLogicalRelation throws AssertError occasionally:
```sql
scala> spark.table("hive_table").show
java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:208)
  at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
  at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
  at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
  at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
  at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
  at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63)
  at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:55)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:585)
  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:581)
  ... 47 elided
````
Most of cases occurred in reading a table which created by an old Spark version.
After recreated the table, the issue will be gone.

After deep dive, the root cause is this external table is a non-partitioned table but the `LOCATION` set to a partitioned path {{/tablename/dt=yyyymmdd}}. The partitionSpec is inferred.

### Why are the changes needed?
Above error message is very confused. We need more details about assert failure information.

This issue caused by `PartitioningAwareFileIndex#inferPartitioning()`. For non-HiveMetastore Spark, it's useful. But for Hive table, it shouldn't infer partition if Hive tell us it's a non partitioned table. (new added)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Add UT.

Closes #26499 from LantaoJin/SPARK-29869.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 15:22:08 +08:00
yudovin 2e71a6e7ba [SPARK-27558][CORE] Gracefully cleanup task when it fails with OOM exception
### What changes were proposed in this pull request?

When a task fails with OOM exception, the `UnsafeInMemorySorter.array` could be `null`. In the meanwhile, the `cleanupResources()` on task completion would call `UnsafeInMemorySorter.getMemoryUsage` in turn, and that lead to another NPE thrown.

### Why are the changes needed?

Check if `array` is null in `UnsafeInMemorySorter.getMemoryUsage` and it should help to avoid NPE.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
It was tested manually.

Closes #26349 from ayudovin/fix-npe-in-listener.

Authored-by: yudovin <artsiom.yudovin@profitero.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-11-18 22:05:34 -08:00
Terry Kim 3d45779b68 [SPARK-29728][SQL] Datasource V2: Support ALTER TABLE RENAME TO
### What changes were proposed in this pull request?

This PR adds `ALTER TABLE a.b.c RENAME TO x.y.x` support for V2 catalogs.

### Why are the changes needed?

The current implementation doesn't support this command V2 catalogs.

### Does this PR introduce any user-facing change?

Yes, now the renaming table works for v2 catalogs:
```
scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show
+---------+---------+
|namespace|tableName|
+---------+---------+
|  ns1.ns2|      old|
+---------+---------+

scala> spark.sql("ALTER TABLE testcat.ns1.ns2.old RENAME TO testcat.ns1.ns2.new").show

scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show
+---------+---------+
|namespace|tableName|
+---------+---------+
|  ns1.ns2|      new|
+---------+---------+
```
### How was this patch tested?

Added unit tests.

Closes #26539 from imback82/rename_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 12:03:29 +08:00
shivsood a834dba120 Revert "[SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils
This reverts commit f7e53865 i.e PR #26301 from master

Closes #26583 from shivsood/revert_29644_master.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-18 18:44:16 -08:00
Yuming Wang 28a502c6e9 [SPARK-28527][FOLLOW-UP][SQL][TEST] Add guides for ThriftServerQueryTestSuite
### What changes were proposed in this pull request?
This PR add guides for `ThriftServerQueryTestSuite`.

### Why are the changes needed?
Add guides

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #26587 from wangyum/SPARK-28527-FOLLOW-UP.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-18 18:13:11 -08:00
HyukjinKwon 882f54b0a3 [SPARK-29870][SQL][FOLLOW-UP] Keep CalendarInterval's toString
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26418. This PR removed `CalendarInterval`'s `toString` with an unfinished changes.

### Why are the changes needed?

1. Ideally we should make each PR isolated and separate targeting one issue without touching unrelated codes.

2. There are some other places where the string formats were exposed to users. For example:

    ```scala
    scala> sql("select interval 1 days as a").selectExpr("to_csv(struct(a))").show()
    ```
    ```
    +--------------------------+
    |to_csv(named_struct(a, a))|
    +--------------------------+
    |      "CalendarInterval...|
    +--------------------------+
    ```

3.  Such fixes:

    ```diff
     private def writeMapData(
        map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = {
      val keyArray = map.keyArray()
    + val keyString = mapType.keyType match {
    +   case CalendarIntervalType =>
    +    (i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i))
    +   case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString
    + }
    ```

    can cause performance regression due to type dispatch for each map.

### Does this PR introduce any user-facing change?

Yes, see 2. case above.

### How was this patch tested?

Manually tested.

Closes #26572 from HyukjinKwon/SPARK-29783.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-19 09:11:41 +09:00
HyukjinKwon 8469614c05 [SPARK-25694][SQL][FOLLOW-UP] Move 'spark.sql.defaultUrlStreamHandlerFactory.enabled' into StaticSQLConf.scala
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/26530 and proposes to move the configuration `spark.sql.defaultUrlStreamHandlerFactory.enabled` to `StaticSQLConf.scala` for consistency.

### Why are the changes needed?

To put the similar configurations together and for readability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested as described in https://github.com/apache/spark/pull/26530.

Closes #26570 from HyukjinKwon/SPARK-25694.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-19 09:08:20 +09:00
Hossein 9514b822a7 [SPARK-29777][SPARKR] SparkR::cleanClosure aggressively removes a function required by user function
### What changes were proposed in this pull request?
The implementation for walking through the user function AST and picking referenced variables and functions, had an optimization to skip a branch if it had already seen it. This runs into an interesting problem in the following example

```
df <- createDataFrame(data.frame(x=1))
f1 <- function(x) x + 1
f2 <- function(x) f1(x) + 2
dapplyCollect(df, function(x) { f1(x); f2(x) })
```
Results in error:
```
org.apache.spark.SparkException: R computation failed with
 Error in f1(x) : could not find function "f1"
Calls: compute -> computeFunc -> f2
```

### Why are the changes needed?
Bug fix

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit tests in `test_utils.R`

Closes #26429 from falaki/SPARK-29777.

Authored-by: Hossein <hossein@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-19 09:04:59 +09:00
Kent Yao ea010a2bc2 [SPARK-29873][SQL][TEST][FOLLOWUP] set operations should not escape when regen golden file with --SET --import both specified
### What changes were proposed in this pull request?

When regenerating golden files, the set operations via `--SET` will not be done, but those with --import should be exceptions because we need the set command.

### Why are the changes needed?

fix test tool.
### Does this PR introduce any user-facing change?

### How was this patch tested?

add ut, but I'm not sure we need these tests for tests itself.
cc maropu cloud-fan

Closes #26557 from yaooqinn/SPARK-29873.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-19 01:32:13 +08:00
Kent Yao ae6b711b26 [SPARK-29941][SQL] Add ansi type aliases for char and decimal
### What changes were proposed in this pull request?

Checked with SQL Standard and PostgreSQL

> CHAR is equivalent to CHARACTER. DEC is equivalent to DECIMAL. INT is equivalent to INTEGER. VARCHAR is equivalent to CHARACTER VARYING. ...

```sql
postgres=# select dec '1.0';
numeric
---------
1.0
(1 row)

postgres=# select CHARACTER '. second';
  bpchar
----------
 . second
(1 row)

postgres=# select CHAR '. second';
  bpchar
----------
 . second
(1 row)
```

### Why are the changes needed?

For better ansi support
### Does this PR introduce any user-facing change?

yes, we add character as char and dec as decimal

### How was this patch tested?

add ut

Closes #26574 from yaooqinn/SPARK-29941.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 23:30:31 +08:00
fuwhu c32e228689 [SPARK-29859][SQL] ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands
### What changes were proposed in this pull request?
Add AlterNamespaceSetLocationStatement, AlterNamespaceSetLocation, AlterNamespaceSetLocationExec to make ALTER DATABASE (SET LOCATION) look up catalog like v2 commands.
And also refine the code of AlterNamespaceSetProperties, AlterNamespaceSetPropertiesExec, DescribeNamespace, DescribeNamespaceExec to use SupportsNamespaces instead of CatalogPlugin for catalog parameter.

### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.

### Does this PR introduce any user-facing change?
Yes, add "ALTER NAMESPACE ... SET LOCATION" whose function is same as "ALTER DATABASE ... SET LOCATION" and "ALTER SCHEMA ... SET LOCATION".

### How was this patch tested?
New unit tests

Closes #26562 from fuwhu/SPARK-29859.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 20:40:23 +08:00
Kent Yao 50f6d930da [SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval
### What changes were proposed in this pull request?

We now have two different implementation for multi-units interval strings to CalendarInterval type values.

One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call `IntervalUtils.fromUnitStrings`

The other is used in `Cast`, which eventually calls `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the other.

We should unify these two for better performance and simple logic. this pr uses the 2nd approach.

### Why are the changes needed?

We should unify these two for better performance and simple logic.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

we shall not fail on existing uts

Closes #26491 from yaooqinn/SPARK-29870.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 15:50:06 +08:00
Kent Yao 5cebe587c7 [SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type
### What changes were proposed in this pull request?

Add 3 interval output types which are named as `SQL_STANDARD`, `ISO_8601`, `MULTI_UNITS`. And we add a new conf `spark.sql.dialect.intervalOutputStyle` for this. The `MULTI_UNITS` style displays the interval values in the former behavior and it is the default. The newly added `SQL_STANDARD`, `ISO_8601` styles can be found in the following table.

Style | conf | Year-Month Interval | Day-Time Interval | Mixed Interval
-- | -- | -- | -- | --
Format With Time Unit Designators | MULTI_UNITS | 1 year 2 mons | 1 days 2 hours 3 minutes 4.123456 seconds | interval 1 days 2 hours 3 minutes 4.123456 seconds
SQL STANDARD  | SQL_STANDARD | 1-2 | 3 4:05:06 | -1-2 3 -4:05:06
ISO8601 Basic Format| ISO_8601| P1Y2M| P3DT4H5M6S|P-1Y-2M3D-4H-5M-6S

### Why are the changes needed?

for ANSI SQL support
### Does this PR introduce any user-facing change?

yes,interval out now has 3 output styles
### How was this patch tested?

add new unit tests

cc cloud-fan maropu MaxGekk HyukjinKwon thanks.

Closes #26418 from yaooqinn/SPARK-29783.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-18 15:42:22 +08:00
gschiavon 73912379d0 [SPARK-29020][SQL] Improving array_sort behaviour
### What changes were proposed in this pull request?
I've noticed that there are two functions to sort arrays sort_array and array_sort.

sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending

array_sort is from 2.4.0 and it only has the possibility of ordering in ascending.

Basically I just added the possibility of ordering either ascending or descending using array_sort.

I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order.
Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions.

### Why are the changes needed?
Basically to be able to sort the array in descending order using *array_sort* instead of using *sort_array* from 1.5.0

### Does this PR introduce any user-facing change?
Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array

### How was this patch tested?
Test's added

This is the link to the [jira](https://issues.apache.org/jira/browse/SPARK-29020)

Closes #25728 from Gschiavon/improving-array-sort.

Lead-authored-by: gschiavon <german.schiavon@lifullconnect.com>
Co-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: gschiavon <Gschiavon@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-18 16:07:05 +09:00
Zhou Jiang ee3bd6d768 [SPARK-25694][SQL] Add a config for URL.setURLStreamHandlerFactory
### What changes were proposed in this pull request?

Add a property `spark.fsUrlStreamHandlerFactory.enabled` to allow users turn off the default registration of `org.apache.hadoop.fs.FsUrlStreamHandlerFactory`

### Why are the changes needed?

This [SPARK-25694](https://issues.apache.org/jira/browse/SPARK-25694) is a long-standing issue. Originally, [[SPARK-12868][SQL] Allow adding jars from hdfs](https://github.com/apache/spark/pull/17342 ) added this for better Hive support. However, this have a side-effect when the users use Apache Spark without `-Phive`. This causes exceptions when the users tries to use another custom factories or 3rd party library (trying to set this). This configuration will unblock those non-hive users.

### Does this PR introduce any user-facing change?

Yes. This provides a new user-configurable property.
By default, the behavior is unchanged.

### How was this patch tested?

Manual testing.

**BEFORE**
```
$ build/sbt package
$ bin/spark-shell
scala> sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory())
java.lang.Error: factory already defined
  at java.net.URL.setURLStreamHandlerFactory(URL.java:1134)
  ... 47 elided
```

**AFTER**
```
$ build/sbt package
$ bin/spark-shell --conf spark.sql.defaultUrlStreamHandlerFactory.enabled=false
scala> sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory())
```

Closes #26530 from jiangzho/master.

Lead-authored-by: Zhou Jiang <zhou_jiang@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: zhou-jiang <zhou_jiang@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-11-18 05:44:00 +00:00
Dongjoon Hyun 42f8f79ff0 [SPARK-29936][R] Fix SparkR lint errors and add lint-r GitHub Action
### What changes were proposed in this pull request?

This PR fixes SparkR lint errors and adds `lint-r` GitHub Action to protect the branch.

### Why are the changes needed?

It turns out that we currently don't run it. It's recovered yesterday. However, after that, our Jenkins linter jobs (`master`/`branch-2.4`) has been broken on `lint-r` tasks.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action on this PR in addition to Jenkins R and AppVeyor R.

Closes #26564 from dongjoon-hyun/SPARK-29936.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-17 21:01:01 -08:00
HyukjinKwon f280c6aa54 [SPARK-29378][R][FOLLOW-UP] Remove manual installation of Arrow dependencies in AppVeyor build
### What changes were proposed in this pull request?

This PR remove manual installation of Arrow dependencies in AppVeyor build

### Why are the changes needed?

It's unnecessary. See https://github.com/apache/spark/pull/26555#discussion_r347178368

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

AppVeyor will test.

Closes #26566 from HyukjinKwon/SPARK-29378.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-18 12:54:21 +09:00