Commit graph

24251 commits

Author SHA1 Message Date
Wenchen Fan ffb362a705 [SPARK-19712][SQL][FOLLOW-UP] reduce code duplication
## What changes were proposed in this pull request?

abstract some common code into a method.

## How was this patch tested?

existing tests

Closes #24281 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-04 00:37:57 +08:00
Liang-Chi Hsieh d04a7371da [MINOR][DOC][SQL] Remove out-of-date doc about ORC in DataFrameReader and Writer
## What changes were proposed in this pull request?

According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it.

## How was this patch tested?

Doc only change.

Closes #24280 from viirya/fix-orc-doc.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-04-03 09:11:09 -07:00
Maxim Gekk 1bc672366d [SPARK-27344][SQL][TEST] Support the LocalDate and Instant classes in Java Bean encoders
## What changes were proposed in this pull request?

- Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`.
- Updated comment for `Encoders.bean`
- New Row getters: `getLocalDate` and `getInstant`
- Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`.

## How was this patch tested?

By `JavaBeanDeserializationSuite`

Closes #24273 from MaxGekk/bean-instant-localdate.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-03 17:45:59 +08:00
Dilip Biswal 3286bff942 [SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator.
## What changes were proposed in this pull request?
In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA.

Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing.

Here is an example:
```
val df = spark.sql("select * from t where sum(ta) > 0")
df.explain(true)
df.show()
```
Resulting exception:
```
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint))
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
	at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
	at scala.Option.getOrElse(Option.scala:138)
```
Checked the behaviour of other database and all of them return an exception:
**Postgress**
```
select * from foo where max(c1) > 0;
Error
ERROR: aggregate functions are not allowed in WHERE Position: 25
```
**DB2**
```
db2 => select * from foo where max(c1) > 0;
SQL0120N  Invalid use of an aggregate function or OLAP function.
```
**Oracle**
```
select * from foo where max(c1) > 0;
ORA-00934: group function is not allowed here
```
**MySql**
```
select * from foo where max(c1) > 0;
Invalid use of group function
```

**Update**
This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid.
## How was this patch tested?
Added tests in AnalysisErrorSuite and group-by.sql

Closes #24209 from dilipbiswal/SPARK-27255.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-03 13:05:06 +08:00
Maxim Gekk 1d20d13149 [SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp
## What changes were proposed in this pull request?

In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception.

One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`.

## How was this patch tested?

The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-03 10:55:56 +08:00
Dilip Biswal b8b5acdd41 [SPARK-19712][SQL][FOLLOW-UP] Don't do partial pushdown when pushing down LeftAnti joins below Aggregate or Window operators.
## What changes were proposed in this pull request?
After [23750](https://github.com/apache/spark/pull/23750), we may pushdown left anti joins below aggregate and window operators with a partial join condition. This is not correct and was pointed out by hvanhovell and cloud-fan [here](https://github.com/apache/spark/pull/23750#discussion_r270017097). This pr addresses their comments.
## How was this patch tested?
Added two new tests to verify the behaviour.

Closes #24253 from dilipbiswal/SPARK-19712-followup.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-03 09:56:27 +08:00
Gabor Somogyi 3628242bd0 [MINOR][DSTREAMS] Add DStreamCheckpointData.cleanup warning if delete returns false
## What changes were proposed in this pull request?

While I was reviewing #24235 I've found a minor addition possibility. Namely `FileSystem.delete` returns a boolean which is not yet checked. In this PR I've added a warning message when it returns false. I've added this as MINOR because no control flow change introduced.

## How was this patch tested?

Existing unit tests.

Closes #24263 from gaborgsomogyi/SPARK-27301-minor.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-04-02 18:34:40 -05:00
Hyukjin Kwon d7dd59a6b4 [SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well
## What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/23285. This PR adds the notes into PySpark and SparkR documentation as well.

While I am here, I revised the doc a bit to make it sound a bit more neutral

## How was this patch tested?

Manually built the doc and verified.

Closes #24272 from HyukjinKwon/SPARK-26224.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-03 08:30:24 +09:00
Hyukjin Kwon 949d712839 [SPARK-27346][SQL] Loosen the newline assert condition on 'examples' field in ExpressionInfo
## What changes were proposed in this pull request?

I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem.

However, this one line:

827383a97c/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java (L82)

made me to investigate a lot today.

Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like,

5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (L146-L150)

to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`.

I think this is not yet found because this particular codes are not released yet (see SPARK-26426).

Looks just better to loosen the condition and forget about this stuff.

This should be backported into branch-2.4 as well.

## How was this patch tested?

N/A

Closes #24274 from HyukjinKwon/SPARK-27346.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-03 08:27:41 +09:00
Yuming Wang 13c5c1fb4b [SPARK-27180][BUILD][YARN] Fix testing issues with yarn module in Hadoop-3
## What changes were proposed in this pull request?

Fix testing issues with `yarn` module in Hadoop-3:

1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```.
2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```.
3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java  to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```.

## How was this patch tested?

manual tests:
```shell
build/sbt yarn/test -Pyarn
build/sbt yarn/test -Phadoop-3.2 -Pyarn

build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2
```

Closes #24115 from wangyum/hadoop3-yarn.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-04-02 15:38:26 -05:00
Gabor Somogyi 57aff93886 [SPARK-26998][CORE] Remove SSL configuration from executors
## What changes were proposed in this pull request?

Different SSL passwords shown up as command line argument on executor side in standalone mode:
* keyStorePassword
* keyPassword
* trustStorePassword

In this PR I've removed SSL configurations from executors.

## How was this patch tested?

Existing + additional unit tests.
Additionally tested with standalone mode and checked the command line arguments:
```
[gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps
94803 CoarseGrainedExecutorBackend
94818 Jps
90149 RemoteMavenServer
91925 Nailgun
94793 SparkSubmit
94680 Worker
94556 Master
398
[gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef | egrep "94556|94680|94793|94803"
  502 94556     1   0  2:02PM ttys007    0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf
  502 94680     1   0  2:02PM ttys007    0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077
  502 94793 94782   0  2:02PM ttys007    0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell
  502 94803 94680   0  2:03PM ttys007    0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899
  502 94910 57352   0  2:05PM ttys008    0:00.00 egrep 94556|94680|94793|94803
```

Closes #24170 from gaborgsomogyi/SPARK-26998.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-04-02 09:18:43 -07:00
Sean Owen d4420b455a [SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code
## What changes were proposed in this pull request?

Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here.

## How was this patch tested?

Existing tests.

Closes #24241 from srowen/SPARK-27323.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-04-02 07:37:05 -07:00
Dongjoon Hyun d575a453db Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp"
This reverts commit c5e83ab92c.
2019-04-02 01:05:54 -07:00
Dongjoon Hyun a0d807d5ab [SPARK-26856][PYSPARK][FOLLOWUP] Fix UT failure due to wrong patterns for Kinesis assembly
## What changes were proposed in this pull request?

After [SPARK-26856](https://github.com/apache/spark/pull/23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console
```
Exception: Found multiple JARs:
.../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar,
.../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar;
please remove all but one
```

It's because the pattern was changed in a wrong way.

**Original**
```python
kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix))
kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix))
```
**After SPARK-26856**
```python
project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix))
project_full_path, "target/%s*.jar" % jar_name_prefix))
```

The actual kinesis assembly jar files look like the followings.

**SBT Build**
```
-rw-r--r--  1 dongjoon  staff  87459461 Apr  1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff       309 Apr  1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar
-rw-r--r--  1 dongjoon  staff       309 Apr  1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar
```

**MAVEN Build**
```
-rw-r--r--   1 dongjoon  staff   8.6K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar
-rw-r--r--   1 dongjoon  staff   8.6K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar
-rw-r--r--   1 dongjoon  staff   8.7K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar
-rw-r--r--   1 dongjoon  staff    21M Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar
```

In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns.

## How was this patch tested?

Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally.
```
$ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly
$ export ENABLE_KINESIS_TESTS=1
$ python/run-tests.py --python-executables python2.7 --module pyspark-streaming
```

Closes #24268 from dongjoon-hyun/SPARK-26856.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-02 14:52:56 +09:00
Marco Gaido 0b150f833c [SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn
## What changes were proposed in this pull request?

We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s.

The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs.

## How was this patch tested?

NA

Closes #23285 from mgaido91/SPARK-26224.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-02 14:12:47 +09:00
Maxim Gekk c5e83ab92c [SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp
## What changes were proposed in this pull request?

In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception.

One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`.

## How was this patch tested?

The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-02 10:20:06 +08:00
Liang-Chi Hsieh eaf008ad0e [SPARK-27329][SQL] Pruning nested field in map of map key and value from object serializers
## What changes were proposed in this pull request?

If object serializer has map of map key/value, pruning nested field should work.

Previously object serializer pruner don't recursively prunes nested fields if it is deeply located in map key or value. This patch proposed to address it by slightly factoring the pruning logic.

## How was this patch tested?

Added tests.

Closes #24260 from viirya/SPARK-27329.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-04-01 13:53:55 -07:00
Giovanni Lanzani 92530c7db1 [SPARK-9792] Make DenseMatrix equality semantical
Before, you could have this code

```
A = SparseMatrix(2, 2, [0, 2, 3], [0], [2])
B = DenseMatrix(2, 2, [2, 0, 0, 0])

B == A  # False
A == B  # True
```

The second would be `True` as `SparseMatrix` already checks for semantic
equality. This commit changes `DenseMatrix` so that equality is
semantical as well.

## What changes were proposed in this pull request?

Better semantic equality for DenseMatrix

## How was this patch tested?

Unit tests were added, plus manual testing. Note that the code falls back to the old behavior when `other` is not a SparseMatrix.

Closes #17968 from gglanzani/SPARK-9792.

Authored-by: Giovanni Lanzani <giovanni@lanzani.nl>
Signed-off-by: Holden Karau <holden@pigscanfly.ca>
2019-04-01 09:30:33 -07:00
Marco Gaido 5888b15d9c [SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not
## What changes were proposed in this pull request?

When `GetMapValue` contains a foldable Map and a non-foldable key, `SimplifyExtractValueOps` fails to optimize it transforming it into case when statements.
The PR adds a case for covering this situation too.

## How was this patch tested?

added UT

Closes #24223 from mgaido91/SPARK-27278.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-04-01 09:09:06 -07:00
Maxim Gekk d332958109 [SPARK-27325][SQL] Add implicit encoders for LocalDate and Instant
## What changes were proposed in this pull request?

Added implicit encoders for the `java.time.LocalDate` and `java.time.Instant` classes. This allows creation of datasets from instances of the types.

## How was this patch tested?

Added new tests to `JavaDatasetSuite` and `DatasetSuite`.

Closes #24249 from MaxGekk/instant-localdate-encoders.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-01 23:02:48 +08:00
Marco Gaido 8012f55a9b [SPARK-26812][SQL] Report correct nullability for complex datatypes in Union
## What changes were proposed in this pull request?

When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA).

The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans.

## How was this patch tested?

added UT

Closes #23726 from mgaido91/SPARK-26812.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-04-01 22:22:10 +08:00
Yuming Wang f799e34962 [MINOR][BUILD] Upgrade apache-rat to 0.13
## What changes were proposed in this pull request?

This PR upgrade `apache-rat` to 0.13. Issues fixed by 0.13:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20RAT%20AND%20fixVersion%20%3D%200.13

## How was this patch tested?

manual tests

Closes #24262 from wangyum/apache-rat.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-01 16:44:42 +09:00
“attilapiros” 9eb896cc3b [SPARK-27333][TEST] Update thread audit whitelist to skip broadcast-exchange-.*, process reaper and StatisticsDataReferenceCleaner threads
## What changes were proposed in this pull request?

Update thread audit whitelist to skip threads of the global broadcast exchange thread pool, process reaper and Hadoop FS statistics data reference cleaner thread.

## How was this patch tested?

Via existing UT using broadcast exchange via `sbt` i.e:

```
> project sql
> testOnly *.SessionStateSuite -- -z "fork new sessions and run query on inherited table"
```

Before (wrapped long line for manually to save horizontal scrolling for reviewers):

```
===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.SessionStateSuite,
thread names: broadcast-exchange-6, broadcast-exchange-0,
broadcast-exchange-2, broadcast-exchange-5, broadcast-exchange-7,
broadcast-exchange-4, broadcast-exchange-1, process reaper, broadcast-exchange-3,
 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner =====
```

After this change no possible thread leak detected.

Closes #24244 from attilapiros/thread-audit-minor.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-31 17:33:31 -07:00
chakravarthiT fc9aad0957 [SPARK-27253][SQL] Prioritizes parent session's SQLConf over SparkConf when cloning a session
## What changes were proposed in this pull request?

Cloned session should prioritize `SQLConf` from parent's over `SparkConf`. Currently, when cloning a session, the child session has configuration set in `SparkConf` even the same properties are set to its parent `SQLConf`.

Currently, when a Spark session is cloned, `mergeSparkConf` in `BaseSessionStateBuilder`'s `conf` overwrites  `SQLConf` values as set in `SparkConf`.

This PR proposes to call `mergeSparkConf` only when the parent session is empty.

See below codes to read.

1. Parent's `sessionState`

c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L268)

c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L157-L161)

5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)

2. Child `sessionState`

c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L269)

c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L155)

c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L102)

c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L74)

5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L305)

5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L283)

5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L292)

5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)

## How was this patch tested?
Added UT and with existing Unit Tests.

Closes #24189 from chakravarthiT/CloneDiscardsConf.

Authored-by: chakravarthiT <tcchakra@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-01 09:33:18 +09:00
Takeshi Yamamuro 885aab40a2 [SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views
## What changes were proposed in this pull request?
The current master doesn't support ANALYZE TABLE to collect tables stats for catalog views even if they are cached as follows;

```scala
scala> sql(s"CREATE VIEW v AS SELECT 1 c")
scala> sql(s"CACHE LAZY TABLE v")
scala> sql(s"ANALYZE TABLE v COMPUTE STATISTICS")
org.apache.spark.sql.AnalysisException: ANALYZE TABLE is not supported on views.;
...
```

Since SPARK-25196 has supported to an ANALYZE command to collect column statistics for cached catalog view, we could support table stats, too.

## How was this patch tested?
Added tests in `StatisticsCollectionSuite` and `InMemoryColumnarQuerySuite`.

Closes #24200 from maropu/SPARK-27266.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-31 17:24:21 -07:00
Maxim Gekk 6115a5e1a0 [SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String]
## What changes were proposed in this pull request?

Added new benchmarks for:
1. JSON functions: `from_json`, `json_tuple` and `get_json_object`
2. Parsing `Dataset[String]` with JSON records
3. Comparing just splitting input text by lines with schema inferring, per-line parsing when encoding is set and not set.

Also existing benchmarks were refactored to use the `NoOp` datasource to eliminate overhead of triggers like `.filter((_: Row) => true).count()`.

## How was this patch tested?

By running `JSONBenchmark` locally.

Closes #24252 from MaxGekk/json-benchmark-func.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-04-01 08:33:16 +09:00
gatorsmile 92b6f86f6d [SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information
## What changes were proposed in this pull request?
This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios.

## How was this patch tested?
N/A

Closes #24257 from gatorsmile/followupSPARK-27244.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-30 22:58:28 -07:00
Yuming Wang b670f39fc6 [SPARK-24793][FOLLOW-UP][K8S] Remove duplicate declaration of mockito-core
## What changes were proposed in this pull request?

```
[WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-kubernetes_2.12🫙3.0.0-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.mockito:mockito-core:jar -> duplicate declaration of version (?)  org.apache.spark:spark-kubernetes_2.12:[unknown-version], /Users/yumwang/spark/resource-managers/kubernetes/core/pom.xml, line 98, column 17
```
This pr remove duplicate declaration of `mockito-core`.

## How was this patch tested?

N/A

Closes #24256 from wangyum/SPARK-24793-FOLLOW-UP.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-30 21:29:32 -07:00
Felix Cheung fa0f791d4d [MINOR][R] fix R project description
## What changes were proposed in this pull request?

update as per this NOTE when running CRAN check

```
The Title field should be in title case, current version then in title case:
‘R Front end for 'Apache Spark'’
‘R Front End for 'Apache Spark'’

```

Closes #24255 from felixcheung/rdesc.

Authored-by: Felix Cheung <felixcheung_m@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-30 21:00:46 -07:00
Sean Owen 754f820035 [SPARK-26918][DOCS] All .md should have ASF license header
## What changes were proposed in this pull request?

Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.

## How was this patch tested?

Doc build

Closes #24243 from srowen/SPARK-26918.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 19:49:45 -05:00
Dongjoon Hyun 88ea319871 Revert "[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores"
This reverts commit f8fa564dec.
2019-03-30 16:35:34 -07:00
Gengliang Wang 5dab5f651f [SPARK-27326][SQL] Fall back all v2 file sources in InsertIntoTable to V1 FileFormat
## What changes were proposed in this pull request?

In the first PR for file source V2, there was a rule for falling back Orc V2 table to OrcFileFormat: https://github.com/apache/spark/pull/23383/files#diff-57e8244b6964e4f84345357a188421d5R34

As we are migrating more file sources to data source V2, we should make the rule more generic. This PR proposes to:
1. Rename the rule `FallbackOrcDataSourceV2 ` to `FallBackFileSourceV2`.The name is more generic. And we use "fall back" as verb, while "fallback" is noun.
2. Rename the method `fallBackFileFormat` in `FileDataSourceV2` to `fallbackFileFormat`. Here we should use "fallback" as noun.
3. Add new method `fallbackFileFormat` in `FileTable`. This is for falling back to V1 in rule `FallbackOrcDataSourceV2 `.

## How was this patch tested?

Existing Unit tests.

Closes #24251 from gengliangwang/fallbackV1Rule.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-30 14:38:26 -07:00
Yuming Wang 0cbef34ede [MINOR][BUILD] Add ASF license header to plugins.sbt
## What changes were proposed in this pull request?

This PR add ASF license header to plugins.sbt, otherwise:
![image](https://user-images.githubusercontent.com/5399861/55273959-670b8800-530d-11e9-9b6f-214a3cde802e.png)

## How was this patch tested?
Warning disappears after adding ASF license header:
![image](https://user-images.githubusercontent.com/5399861/55273961-6c68d280-530d-11e9-9d15-5fb73a1b991e.png)

Closes #24248 from wangyum/plugins.sbt.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 12:47:02 -05:00
Yuming Wang 44b0d328e5 [MINOR] Update the scala version of LICENSE-binary to 2.12
## What changes were proposed in this pull request?

Update the scala version of `LICENSE-binary` to 2.12.

## How was this patch tested?

N/A

Closes #24250 from wangyum/LICENSE-binary.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 12:46:08 -05:00
liulijia f8fa564dec [SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores
## What changes were proposed in this pull request?
spark.task.cpus should be less or equal than spark.executor.cores when use static executor allocation
## How was this patch tested?
manual

Closes #24131 from liutang123/SPARK-27192.

Authored-by: liulijia <liutang123@yeah.net>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 12:38:05 -05:00
Sean Owen 2ec650d843 [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data
## What changes were proposed in this pull request?

(See JIRA for problem statement)

Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix.

There appear to be no other changes of consequence:
https://github.com/xerial/snappy-java/blob/master/Milestone.md

## How was this patch tested?

Existing tests

Closes #24242 from srowen/SPARK-27267.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 02:41:24 -05:00
Kent Yao f4c73b7c68 [SPARK-27301][DSTREAM] Shorten the FileSystem cached life cycle to the cleanup method inner scope
## What changes were proposed in this pull request?

The cached FileSystem's token will expire if no tokens explicitly are add into it.

```scala
19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 83189
19/03/28 13:40:16 INFO rdd.MapPartitionsRDD: Removing RDD 82860 from persistence list
19/03/28 13:40:16 INFO spark.ContextCleaner: Cleaned shuffle 6005
19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 82860
19/03/28 13:40:16 INFO scheduler.ReceivedBlockTracker: Deleting batches:
19/03/28 13:40:16 INFO scheduler.InputInfoTracker: remove old batch metadata: 1553750250000 ms
19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800
19/03/28 13:40:17 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800
19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800
19/03/28 13:40:17 WARN hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-1396157959_1] for 53 seconds. Will retry shortly ...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy11.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy12.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:878)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
at java.lang.Thread.run(Thread.java:748)
```

This PR shorten the FileSystem cached life cycle to the cleanup method inner scope in case of token expiry.

## How was this patch tested?

existing ut

Closes #24235 from yaooqinn/SPARK-27301.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 02:35:49 -05:00
Sean Owen e6d8d0f13f [SPARK-27121][REPL] Resolve Scala compiler failure for Java 9+ in REPL
## What changes were proposed in this pull request?

Avoid trying to extract the classpath of the environment from a URLClassLoader in Java 11, as the default classloader isn't one. Use `java.class.path` instead.

## How was this patch tested?

Existing tests, manually tested under Java 11.

Closes #24239 from srowen/SPARK-27121.0.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 02:30:34 -05:00
10129659 144b35fe3a [SPARK-27320][SQL] Replacing index with iterator to traverse the expressions list in AggregationIterator, which make it simpler
## What changes were proposed in this pull request?
In AggregationIterator's loop function, we access the expressions by `expressions(i)`, the type of `expressions` is `::`, a subtype of list.

```
while (i < expressionsLength) {
      val func = expressions(i).aggregateFunction
```

This PR replacing  index with iterator to access the expressions list, which make it simpler.

## How was this patch tested?
Existing tests.

Closes #24238 from eatoncys/array.

Authored-by: 10129659 <chen.yanshan@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 02:27:12 -05:00
Takuya UESHIN f176dd3f28 [SPARK-27314][SQL] Deduplicate exprIds for Union.
## What changes were proposed in this pull request?

We have been having a potential problem with `Union` when the children have the same expression id in their outputs, which happens when self-union.

## How was this patch tested?

Modified some tests to adjust plan changes.

Closes #24236 from ueshin/issues/SPARK-27314/dedup_union.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-29 14:05:38 -07:00
Maxim Gekk 61561c1c2d [SPARK-27252][SQL][FOLLOWUP] Calculate min and max days independently from time zone in ComputeCurrentTimeSuite
## What changes were proposed in this pull request?

This fixes the `analyzer should replace current_date with literals` test in `ComputeCurrentTimeSuite` by making calculation of `min` and `max` days independent from time zone.

## How was this patch tested?

by `ComputeCurrentTimeSuite`.

Closes #24240 from MaxGekk/current-date-followup.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-29 14:28:36 -05:00
Ninad Ingole dbc7ce18b9 [SPARK-27244][CORE] Redact Passwords While Using Option logConf=true
## What changes were proposed in this pull request?

When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text.

## How was this patch tested?
This patch was tested through `SparkConfSuite` & then entire unit test through sbt

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #24196 from ninadingole/SPARK-27244.

Authored-by: Ninad Ingole <robert.wallis@example.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-29 14:16:53 -05:00
Maxim Gekk 06abd06112 [SPARK-27252][SQL] Make current_date() independent from time zones
## What changes were proposed in this pull request?

This makes the `CurrentDate` expression and `current_date` function independent from time zone settings. New result is number of days since epoch in `UTC` time zone. Previously, Spark shifted the current date (in `UTC` time zone) according the session time zone which violets definition of `DateType` - number of days since epoch (which is an absolute point in time, midnight of Jan 1 1970 in UTC time).

The changes makes `CurrentDate` consistent to `CurrentTimestamp` which is independent from time zone too.

## How was this patch tested?

The changes were tested by existing test suites like `DateExpressionsSuite`.

Closes #24185 from MaxGekk/current-date.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-28 18:44:08 -07:00
Xianyang Liu 50cded590f [MINOR] Move java file to java directory
## What changes were proposed in this pull request?

move
```scala
org.apache.spark.sql.execution.streaming.BaseStreamingSource
org.apache.spark.sql.execution.streaming.BaseStreamingSink
```
to java directory

## How was this patch tested?

Existing UT.

Closes #24222 from ConeyLiu/move-scala-to-java.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-28 12:11:00 -05:00
zhoukang 43bf4ae641 [SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session
## What changes were proposed in this pull request?

When using fair scheduler mode for thrift server, we may have unpredictable result.
```
val pool = sessionToActivePool.get(parentSession.getSessionHandle)
if (pool != null) {
   sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, pool)
}
```
The cause is we use thread pool to execute queries for thriftserver, and when we call setLocalProperty we may have unpredictab behavior.

```
/**
   * Set a local property that affects jobs submitted from this thread, such as the Spark fair
   * scheduler pool. User-defined properties may also be set here. These properties are propagated
   * through to worker tasks and can be accessed there via
   * [[org.apache.spark.TaskContext#getLocalProperty]].
   *
   * These properties are inherited by child threads spawned from this thread. This
   * may have unexpected consequences when working with thread pools. The standard java
   * implementation of thread pools have worker threads spawn other worker threads.
   * As a result, local properties may propagate unpredictably.
   */
  def setLocalProperty(key: String, value: String) {
    if (value == null) {
      localProperties.get.remove(key)
    } else {
      localProperties.get.setProperty(key, value)
    }
  }
```

I post an example on https://jira.apache.org/jira/browse/SPARK-26914 .

## How was this patch tested?
UT

Closes #23826 from caneGuy/zhoukang/fix-scheduler-error.

Authored-by: zhoukang <zhoukang199191@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-28 09:24:16 -05:00
Wenchen Fan e4a968d829 [MINOR][CORE] Remove import scala.collection.Set in TaskSchedulerImpl
## What changes were proposed in this pull request?

I was playing with the scheduler and found this weird thing. In `TaskSchedulerImpl` we import `scala.collection.Set` without any reason. This is bad in practice, as it silently changes the actual class when we simply type `Set`, which by default should point to the immutable set.

This change only affects one method: `getExecutorsAliveOnHost`. I checked all the caller side and none of them need a general `Set` type.

## How was this patch tested?

N/A

Closes #24231 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-28 21:12:18 +09:00
Stavros Kontopoulos 39577a27a0 [SPARK-24902][K8S] Add PV integration tests
## What changes were proposed in this pull request?

- Adds persistent volume integration tests
- Adds a custom tag to the test to exclude it if it is run against a cloud backend.
- Assumes default fs type for the host, AFAIK that is ext4.

## How was this patch tested?
Manually run the tests against minikube as usual:
```
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test)  spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 192 milliseconds.
Run starting. Expected test count is: 16
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Test PVs with local storage
```

Closes #23514 from skonto/pvctests.

Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2019-03-27 13:00:56 -07:00
Gengliang Wang 49b0411549 [SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files
## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/23130, all empty files are excluded from target file splits in `FileSourceScanExec`.
In File source V2, we should keep the same behavior.

This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them.
## How was this patch tested?

Unit test

Closes #24227 from gengliangwang/ignoreEmptyFile.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-27 10:08:38 -07:00
Daoyuan Wang f1fe805bed [SPARK-27279][SQL] Reuse subquery should compare child plan of SubqueryExec
## What changes were proposed in this pull request?

For now, `ReuseSubquery` in Spark compares two subqueries at `SubqueryExec` level, which invalidates the `ReuseSubquery` rule. This pull request fixes this, and add a configuration key for subquery reuse exclusively.

## How was this patch tested?

add a unit test.

Closes #24214 from adrian-wang/reuse.

Authored-by: Daoyuan Wang <me@daoyuan.wang>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-03-27 08:45:22 -07:00
Takeshi Yamamuro 956b52b167 [SPARK-26771][SQL][FOLLOWUP] Make all the uncache operations non-blocking by default
## What changes were proposed in this pull request?
To make the blocking behaviour consistent, this pr made catalog table/view `uncacheQuery` non-blocking by default. If this pr merged, all the behaviours in spark are non-blocking by default.

## How was this patch tested?
Pass Jenkins.

Closes #24212 from maropu/SPARK-26771-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-03-27 21:01:36 +09:00