Commit graph

24115 commits

Author SHA1 Message Date
Shahid 1853db3186 [SPARK-27125][SQL][TEST] Add test suite for sql execution page
## What changes were proposed in this pull request?
Added test suite for AllExecutionsPage class. Checked the scenarios for SPARK-27019 and SPARK-27075.

## How was this patch tested?
Added UT, manually tested

Closes #24052 from shahidki31/SPARK-27125.

Authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-12 10:15:28 -05:00
Dilip Biswal b15423361b [SPARK-27016][SQL][BUILD][FOLLOW-UP] Treat all antlr warnings as errors while generating the parser.
## What changes were proposed in this pull request?
Use the sbt maven plugin option`antlr4TreatWarningsAsErrors` to make sure the warnings are treated as errors while generating the parser. In the absence of it, we may inadvertently introduce problems while making grammar changes. Please refer to PR-23897 to know more about the context. We made a change in [pr-23925](https://github.com/apache/spark/pull/23925) which handled only the maven build.

In this PR, we handle the sbt build. I had submitted [PR-23](https://github.com/ihji/sbt-antlr4/pull/23) to enhance the sbt-antlr plugin to make is possible to pass the error on warning option.

## How was this patch tested?
Force an warning in the grammar file to check if the build fails. Then remove the warning to verify the build succeeds.

Closes #24060 from dilipbiswal/sbt_build_antlr.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-11 21:27:51 -07:00
Ajith b8dd84b9e4 [SPARK-27011][SQL] reset command fails with cache
## What changes were proposed in this pull request?

When cache is enabled ( i.e once cache table command is executed), any following sql will trigger
 CacheManager#lookupCachedData which will create a copy of the tree node, which inturn calls TreeNode#makeCopy. Here the problem is it will try to create a copy instance. But as ResetCommand is a case object this will fail

## How was this patch tested?

Added UT to reproduce the issue

Closes #23918 from ajithme/reset.

Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-12 11:02:09 +08:00
Maxim Gekk 60be6d2ea3 [SPARK-27109][SQL] Refactoring of TimestampFormatter and DateFormatter
## What changes were proposed in this pull request?

In PR, I propose to refactor the `parse()` method of `Iso8601DateFormatter`/`Iso8601DateFormatter` and `toInstantWithZoneId` of `toInstantWithZoneId` to achieve the following:
- Avoid unnecessary conversion of parsed input to `java.time.Instant` before converting it to micros and days. Necessary information exists in `ZonedDateTime` already, and micros/days can be extracted from the former one.
- Avoid additional extraction of LocalTime from parsed object, more precisely, double query of `TemporalQueries.localTime` from `temporalAccessor`.
- Avoid additional extraction of zone id from parsed object, in particular, double query of `TemporalQueries.offset()`.
- Using `ZoneOffset.UTC` instead of `ZoneId.of` in `DateFormatter`. This allows to avoid looking for zone offset by zone id.

## How was this patch tested?

By existing test suite `DateTimeUtilsSuite`, `TimestampFormatterSuite` and `DateFormatterSuite`.

Closes #24030 from MaxGekk/query-localtime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-11 19:02:30 -05:00
hongdongdong 1029bf9c35 Use variable instead of function to keep the format uniform
## What changes were proposed in this pull request?

The change just use  variable(_taskScheduler) instead of function(taskScheduler) to keep the format uniform in different situation.

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #24048 from hddong/Use-variable-instead-of-function.

Authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-11 19:00:26 -05:00
Hyukjin Kwon 3725b1324f [SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner
## What changes were proposed in this pull request?

This PR proposes to have one base R runner.

In the high level,

Previously, it had `ArrowRRunner` and it inherited `RRunner`:

```
└── RRunner
    └── ArrowRRunner
```

After this PR, now it has a `BaseRRunner`, and `ArrowRRunner` and `RRunner` inherit `BaseRRunner`:

```
└── BaseRRunner
    ├── ArrowRRunner
    └── RRunner
```

This way is consistent with Python's.

In more details, see below:

```scala
class BaseRRunner[IN, OUT] {

  def compute: Iterator[OUT] = {
    ...
    newWriterThread(...).start()
    ...
    newReaderIterator(...)
    ...
  }

  // Make a thread that writes data from JVM to R process
  abstract protected def newWriterThread(..., iter: Iterator[IN], ...): WriterThread

  // Make an iterator that reads data from the R process to JVM
  abstract protected def newReaderIterator(...): ReaderIterator

  abstract class WriterThread(..., iter: Iterator[IN], ...) extends Thread {
    override def run(): Unit {
      ...
      writeIteratorToStream(...)
      ...
    }

    // Actually writing logic to the socket stream.
    abstract protected def writeIteratorToStream(dataOut: DataOutputStream): Unit
  }

  abstract class ReaderIterator extends Iterator[OUT] {
    override def hasNext(): Boolean = {
      ...
      read(...)
      ...
    }

    override def next(): OUT = {
      ...
      hasNext()
      ...
    }

    // Actually reading logic from the socket stream.
    abstract protected def read(...): OUT
  }
}
```

```scala
case [Arrow]RRunner extends BaseRRunner {
  override def newWriterThread(...) {
    new WriterThread(...) {
      override def writeIteratorToStream(...) {
        ...
      }
    }
  }

  override def newReaderIterator(...) {
    new ReaderIterator(...) {
      override def read(...) {
        ...
      }
    }
  }
}
```

## How was this patch tested?

Manually tested and existing tests should cover.

Closes #23977 from HyukjinKwon/SPARK-26923.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-12 08:45:29 +09:00
hehuiyuan fd1852b344 [MINOR][DOC] Fix spark.kubernetes.executor.label.[LabelName] parameter meaning
## What changes were proposed in this pull request?

It would be better to change the explanation to this spark.kubernetes.executor.label.[LabelName].
Before:
    Note that Spark also adds its own labels to the **driver pod** for bookkeeping purposes.

After modification:
   Note that Spark also adds its own labels to the **executor pod** for bookkeeping purposes.

Closes #24054 from hehuiyuan/hehuiyuan-patch-3.

Authored-by: hehuiyuan <hehuiyuan@ZBMAC-C02WD3K5H.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-11 16:26:47 -07:00
sychen 064604aaa7 [SPARK-27073][CORE] Fix a race condition when handling of IdleStateEvent
## What changes were proposed in this pull request?

When TransportChannelHandler processes IdleStateEvent, it first calculates whether the last request time has timed out.
At this time, TransportClient.sendRpc initiates a request.
TransportChannelHandler gets responseHandler.numOutstandingRequests() > 0, causing the normal connection to be closed.

## How was this patch tested?

Closes #23989 from cxzl25/fix_IdleStateEvent_timeout.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-11 15:16:16 -07:00
Marcelo Vanzin f1e223bfa3 [SPARK-27004][CORE] Remove stale HTTP auth code.
This code is from the era when Spark used an HTTP server to distribute
dependencies, which is long gone. Nowadays it only causes problems when
someone is using dependencies from an HTTP server with Spark auth on.

Closes #24033 from vanzin/SPARK-27004.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-11 12:27:25 -07:00
chandulal.kavar d4542a8ba8 [SPARK-27061][K8S] Expose Driver UI port on driver service to access …
## What changes were proposed in this pull request?

Expose Spark UI port on driver service to access logs from service.

## How was this patch tested?

The patch was tested using unit tests being contributed as a part of the PR

Closes #23990 from chandulal/SPARK-27061.

Authored-by: chandulal.kavar <cckavar@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-11 10:41:31 -07:00
Wenchen Fan 31878c9daa [SPARK-27119][SQL] Do not infer schema when reading Hive serde table with native data source
## What changes were proposed in this pull request?

In Spark 2.1, we hit a correctness bug. When reading a Hive serde parquet table with the native parquet data source, and the actual file schema doesn't match the table schema in Hive metastore(only upper/lower case difference), the query returns 0 results.

The reason is that, the parquet reader is case sensitive. If we push down filters with column names that don't match the file physical schema case-sensitively, no data will be returned.

To fix this bug, there were 2 solutions proposed at that time:
1. Add a config to optionally disable parquet filter pushdown, and make parquet column pruning case insensitive.
https://github.com/apache/spark/pull/16797

2. Infer the actual schema from data files, when reading Hive serde table with native data source. A config is provided to disable it.
https://github.com/apache/spark/pull/17229

Solution 2 was accepted and merged to Spark 2.1.1

In Spark 2.4, we refactored the parquet data source a little:
1. do parquet filter pushdown with the actual file schema.
https://github.com/apache/spark/pull/21696

2. make parquet filter pushdown case insensitive.
https://github.com/apache/spark/pull/22197

3. make parquet column pruning case insensitive.
https://github.com/apache/spark/pull/22148

With these patches, the correctness bug in Spark 2.1 no longer exists, and the schema inference becomes unnecessary.

To be safe, this PR just changes the default value to NEVER_INFER, so that users can set it back to INFER_AND_SAVE. If we don't receive any bug reports for it, we can remove the related code in the next release.

## How was this patch tested?

existing tests

Closes #24041 from cloud-fan/infer.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-03-11 09:44:29 -07:00
Jagadesh Kiran d9978fb4e4 [SPARK-26860][PYSPARK][SPARKR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation
The docs describing RangeBetween & RowsBetween for pySpark & SparkR are not in sync with Spark description.

a. Edited PySpark and SparkR docs  and made description same for both RangeBetween and RowsBetween
b. created executable examples in both pySpark and SparkR documentation
c. Locally tested the patch for scala Style checks and UT for checking no testcase failures

Closes #23946 from jagadesh-kiran/master.

Authored-by: Jagadesh Kiran <jagadesh.n@in.verizon.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-11 08:53:09 -05:00
Ajith b98922abf2 [SPARK-27116] Environment tab must sort Hadoop Configuration by default
## What changes were proposed in this pull request?

Environment tab in SparkUI do not have Hadoop Configuration sorted. All other tables in the same page like Spark Configrations, System Configuration etc are sorted by keys by default

## How was this patch tested?

Manually tested on SparkUI

Closes #24038 from ajithme/sqluisort.

Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-11 08:43:49 -05:00
Wenchen Fan 8114b63d56 [SPARK-27117][SQL] current_date/current_timestamp should not refer to columns with ansi parser mode
## What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/19559 .

It revisits https://issues.apache.org/jira/browse/SPARK-27117 , which should be an invalid use case according to the SQL standard.

`current_date/current_timestamp` are reserved keywords, if users want to access columns named `current_date/current_timestamp`, they should quote the name like ```select `current_date` from tbl```

If ansi mode is not enabled(which is the default), this PR won't introduce any changes.

## How was this patch tested?

a new test case

Closes #24039 from cloud-fan/current_datetime.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-11 20:40:03 +08:00
Dilip Biswal 1b9fd67904 [SPARK-27096][SQL] Reconcile the join types between data frame and sql interface
## What changes were proposed in this pull request?
Currently in the grammar file, we have the joinType rule defined as following :
```
joinType
    : INNER?
   ....
   ....
    | LEFT SEMI
    | LEFT? ANTI
    ;
```
The keyword LEFT is optional for ANTI join even though its not optional for SEMI join. When
using data frame interface join type "anti" is not allowed. The allowed types are "left_anti" or
"leftanti" for anti joins. ~~In this PR, i am making the LEFT keyword mandatory for ANTI joins so
it aligns better with the LEFT SEMI join in the grammar file and also the join types allowed from dataframe api.~~

This PR makes LEFT optional for SEMI join in .g4 and add "semi" and "anti" join types from dataframe.

~~I have not opened any JIRA for this as probably we may need some discussion to see if
we are going to address this or not.~~

## How was this patch tested?
Modified the join type tests.

Closes #23982 from dilipbiswal/join_fix.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-11 14:02:21 +08:00
Takeshi Yamamuro 7a9537c338 [SPARK-21351][SQL] Remove the UpdateAttributeNullability rule from the optimizer
## What changes were proposed in this pull request?
This pr removed `UpdateAttributeNullability` from the optimizer because the same logic happens in the analyzer. See SPARK-26459(#23390) for more detailed discussion.

## How was this patch tested?
N/A

Closes #23508 from maropu/SPARK-21351.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-11 13:34:14 +08:00
Jungtaek Lim (HeartSaVioR) f0bde69ebc [MINOR][SQL] Throw better exception for Encoder with tuple more than 22 elements
## What changes were proposed in this pull request?

This patch proposes to throw better exception with better error message when encoding to tuple which elements are more than 22.

**BEFORE**
```scala
scala> import org.apache.spark.sql.catalyst.encoders._
scala> val encoders = (0 to 22).map(_ => org.apache.spark.sql.Encoders.scalaInt.asInstanceOf[ExpressionEncoder[_]])
scala> ExpressionEncoder.tuple(encoders)
java.lang.ClassNotFoundException: scala.Tuple23
```

**AFTER**
```scala
scala> ExpressionEncoder.tuple(encoders)
java.lang.UnsupportedOperationException: Due to Scala's limited support of tuple, tuple with more than 22 elements are not supported.
```

## How was this patch tested?

Added UT.

Closes #24046 from HeartSaVioR/MINOR-throw-better-exception-for-tuple-more-than-22.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-03-11 13:44:45 +09:00
Yuming Wang 8ab13065f6 [SPARK-23807][FOLLOW-UP][BUILD][TEST-HADOOP3.1] Add test-hadoop3.1 phrase
## What changes were proposed in this pull request?

Add `test-hadoop3.1` phrase to test Spark against Spark’s Hadoop 3.1 profile.

## How was this patch tested?
Tested on jenkins. This is output:
```
[info] Using build tool sbt with Hadoop profile hadoop3.1 under environment amplab_jenkins
...
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:  -Phadoop-3.1 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly
```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103282/console

Closes #24045 from wangyum/SPARK-23807.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 11:15:58 +09:00
Hyukjin Kwon 93ea353cae [SPARK-26920][R] Deduplicate type checking across Arrow optimization in SparkR
## What changes were proposed in this pull request?

This PR proposes two things.

1.. Deduplicates the type checking logic. While I am here, I checked each type. Currently, binary type, float type, nested struct type and array type are not supported.

**For map and nested struct types:**

 it's expected to be unsupported per Spark's arrow optimization.

```
Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type: map<string,double>
...
```
```
Exception in thread "serve-Arrow" java.lang.UnsupportedOperationException: Unsupported data type:  struct<type:tinyint,size:int,indices:array<int>,values:array<double>>
...
```

Please track the trace below to double check.

```
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowType(ArrowUtils.scala:56)
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowField(ArrowUtils.scala:92)
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.$anonfun$toArrowSchema$1(ArrowUtils.scala:116)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at scala.collection.TraversableLike.map(TraversableLike.scala:237)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
	at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
	at org.apache.spark.sql.execution.arrow.ArrowUtils$.toArrowSchema(ArrowUtils.scala:115)
	at org.apache.spark.sql.execution.arrow.ArrowBatchStreamWriter.<init>(ArrowConverters.scala:50)
	at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2(Dataset.scala:3215)
	at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$2$adapted(Dataset.scala:3212)
```

**For float and binary types:**

They cause corrupt values in some cases. It needs to be investigated separately.

**For array type:**

```
Error in Table__to_dataframe(x, use_threads = use_threads) :
  cannot handle Array of type list
```

Seems to be Arrow's R library limitation. It needs to be investigated separately as well.

2.. While I am touching the type specification codes across Arrow optimization, I move the Arrow optimization related tests into a separate filed called `test_arrow.R`.

## How was this patch tested?

Tests were added and also manually tested.

Closes #23969 from HyukjinKwon/SPARK-26920.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 10:23:24 +09:00
Gabor Somogyi 3729efb4d0 [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs
## What changes were proposed in this pull request?

Avro is built-in but external data source module since Spark 2.4 but  `from_avro` and `to_avro` APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

## How was this patch tested?

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

Closes #23797 from gaborgsomogyi/SPARK-26856.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 10:15:07 +09:00
Gabor Somogyi 29d9021245 [SPARK-24621][WEBUI] Show secure URLs on web pages
## What changes were proposed in this pull request?

Web UI URLs are pointing to `http://` targets even if SSL is enabled. In this PR I've changed the code to point to `https://` URLs.

## How was this patch tested?

Existing unit tests + manually by starting standalone master/worker/spark-shell. Please see jira.

Closes #23991 from gaborgsomogyi/SPARK-24621.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-10 19:28:35 -05:00
Yuming Wang eed3091a60 [SPARK-27120][BUILD][TEST] Upgrade scalatest version to 3.0.5
## What changes were proposed in this pull request?

**ScalaTest 3.0.5 Release Notes**

**Bug Fixes**

- Fixed the implicit view not available problem when used with compile macro.
- Fixed a stack depth problem in RefSpecLike and fixture.SpecLike under Scala 2.13.
- Changed Framework and ScalaTestFramework to set spanScaleFactor for Runner object instances for different Runners using different class loaders. This fixed a problem whereby an incorrect Runner.spanScaleFactor could be used when the tests for multiple sbt project's were run concurrently.
- Fixed a bug in endsWith regex matcher.

**Improvements**
- Removed duplicated parsing code for -C in ArgsParser.
- Improved performance in WebBrowser.
- Documentation typo rectification.
- Improve validity of Junit XML reports.
- Improved performance by replacing all .size == 0 and .length == 0 to .isEmpty.

**Enhancements**
- Added 'C' option to -P, which will tell -P to use cached thread pool.
- External Dependencies Update
- Bumped up scala-js version to 0.6.22.
- Changed to depend on mockito-core, not mockito-all.
- Bumped up jmock version to 2.8.3.
- Bumped up junit version to 4.12.
- Removed dependency to scala-parser-combinators.

More details:
http://www.scalatest.org/release_notes/3.0.5

## How was this patch tested?

manual tests on local machine:
```
nohup build/sbt clean -Djline.terminal=jline.UnsupportedTerminal -Phadoop-2.7  -Pkubernetes -Phive-thriftserver -Pyarn -Pspark-ganglia-lgpl -Phive -Pkinesis-asl -Pmesos test > run.scalatest.log &
```

Closes #24042 from wangyum/SPARK-27120.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-10 15:22:52 -07:00
Takeshi Yamamuro f0927d8ac4 [SPARK-27110][SQL] Moves some functions from AnalyzeColumnCommand to command/CommandUtils
## What changes were proposed in this pull request?
To reuse some common logics for improving `Analyze` commands (See the description of `SPARK-25196` for details), this pr moved some functions from `AnalyzeColumnCommand` to `command/CommandUtils`.  A follow-up pr will add code to extend `Analyze` commands for cached tables.

## How was this patch tested?
Existing tests.

Closes #22204 from maropu/SPARK-25196.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-10 15:17:46 -07:00
Hyukjin Kwon 28d003097b [SPARK-27102][R][PYTHON][CORE] Remove the references to Python's Scala codes in R's Scala codes
## What changes were proposed in this pull request?

Currently, R's Scala codes happened to refer Python's Scala codes for code deduplications. It's a bit odd. For instance, when we face an exception from R, it shows python related code path, which makes confusing to debug. It should rather have one code base and R's and Python's should share.

This PR proposes:

1. Make a `SocketAuthServer` and move `PythonServer` so that `PythonRDD` and `RRDD` can share it.
2. Move `readRDDFromFile` and `readRDDFromInputStream` into `JavaRDD`.
3. Reuse `RAuthHelper` and remove `RSocketAuthHelper` in `RRDD`.
4. Rename `getEncryptionEnabled` to `isEncryptionEnabled` while I am here.

So, now, the places below:

- `sql/core/src/main/scala/org/apache/spark/sql/api/r`
- `core/src/main/scala/org/apache/spark/api/r`
- `mllib/src/main/scala/org/apache/spark/ml/r`

don't refer Python's Scala codes.

## How was this patch tested?

Existing tests should cover this.

Closes #24023 from HyukjinKwon/SPARK-27102.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-10 15:08:23 +09:00
Yuming Wang 470313e660 [SPARK-27118][SQL] Upgrade Hive Metastore Client to the latest versions for Hive 1.0.x/1.1.x
## What changes were proposed in this pull request?

Hive 1.1.1 and Hive 1.0.1 released. We should upgrade Hive Metastore Client version.

https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329444&styleName=Text&projectId=12310843
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329557&styleName=Text&projectId=12310843

## How was this patch tested?

N/A

Closes #24040 from wangyum/SPARK-27118.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-09 16:50:10 -08:00
Yuming Wang f732647ae4 [SPARK-27054][BUILD][SQL] Remove the Calcite dependency
## What changes were proposed in this pull request?

Calcite is only used for [runSqlHive](02bbe977ab/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L699-L705)) when `hive.cbo.enable=true`([SemanticAnalyzer](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280)).
So we can disable `hive.cbo.enable` and remove Calcite dependency.

## How was this patch tested?

Exist tests

Closes #23970 from wangyum/SPARK-27054.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <wgyumg@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-09 16:34:24 -08:00
Shixiong Zhu 6e1c0827ec
[SPARK-27111][SS] Fix a race that a continuous query may fail with InterruptedException
## What changes were proposed in this pull request?

Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally.

To handle this case, this PR has the following changes:

- Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread).
- Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly.

I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up.

## How was this patch tested?

Jenkins

Closes #24034 from zsxwing/SPARK-27111.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-03-09 14:26:58 -08:00
Marco Gaido 25bcf59b3b [SPARK-25838][ML] Remove formatVersion from Saveable
## What changes were proposed in this pull request?

`Saveable` interface introduces `formatVersion` which is protected and it is used nowhere. So the PR proposes to remove it.

## How was this patch tested?

existing tests

Closes #22830 from mgaido91/SPARK-25838.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-09 09:44:20 -06:00
CodeGod a29df5fa02 [SPARK-27080][SQL] bug fix: mergeWithMetastoreSchema with uniform lower case comparison
## What changes were proposed in this pull request?
When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it.

## How was this patch tested?
Unit test

Closes #24001 from codeborui/mergeSchemaBugFix.

Authored-by: CodeGod <>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-09 21:28:10 +08:00
Kris Mok 57ae251f75 [SPARK-27097] Avoid embedding platform-dependent offsets literally in whole-stage generated code
## What changes were proposed in this pull request?

Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it:
- Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors.
  - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only.
- Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program.

In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as:
```java
Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
```
This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field.
But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption.

One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`.

Bad:
```scala
val baseOffset = Platform.BYTE_ARRAY_OFFSET
// codegen template:
s"Platform.putLong($buffer, $baseOffset, $value);"
```
This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code.

Good:
```scala
val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
// codegen template:
s"Platform.putLong($buffer, $baseOffset, $value);"
```
This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors.

Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness.

NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic.

## How was this patch tested?

Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job.

Authored-by: Kris Mok <kris.mokdatabricks.com>

Closes #24031 from gatorsmile/cherrypickSPARK-27097.

Lead-authored-by: Kris Mok <kris.mok@databricks.com>
Co-authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-03-09 01:20:32 +00:00
liuxian 326fc746d3 [SPARK-27056][MESOS] Remove start-shuffle-service.sh
## What changes were proposed in this pull request?

 `start-shuffle-service.sh`  was only used by Mesos  before `start-mesos-shuffle-service.sh`.
Obviously, `start-mesos-shuffle-service.sh` solves some problems, it is better than  `start-shuffle-service.sh`.
So now we should delete `start-shuffle-service.sh` in case users use it.

## How was this patch tested?
N/A

Closes #23975 from 10110346/rmshuffleservice_sh.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-08 18:51:38 -06:00
sandeep-katta 14f2286e56 [SPARK-27101][PYTHON] Drop the created database after the test in test_session
## What changes were proposed in this pull request?

Cleaning the testcase, drop the database after use

## How was this patch tested?

existing UT

Closes #24021 from sandeep-katta/cleanPythonTest.

Authored-by: sandeep-katta <sandeep.katta2007@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-09 09:12:33 +09:00
Sunitha Kambhampati bd2710bd79 [MINOR][SQL] Fix the typo in the spark.sql.extensions conf doc
## What changes were proposed in this pull request?
Fix the  typo (missing the s)  in the class name (SparkSessionExtensions)  in the doc for Spark conf spark.sql.extensions.

## How was this patch tested?
Verified by checking that the configuration doc shows up correctly in spark-shell using the SET -v

Closes #24020 from skambha/fixnametypo.

Authored-by: Sunitha Kambhampati <skambha@us.ibm.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-09 08:51:19 +09:00
SongYadong 14b1312727 [SPARK-27103][SQL][MINOR] List SparkSql reserved keywords in alphabet order
## What changes were proposed in this pull request?

This PR tries to correct spark-sql reserved keywords' position in list if they are not in alphabetical order.
In test suite some repeated words are removed. Also some comments are added for remind.

## How was this patch tested?

Existing unit tests.

Closes #23985 from SongYadong/sql_reserved_alphabet.

Authored-by: SongYadong <song.yadong1@zte.com.cn>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-08 10:51:39 -08:00
wangguangxin.cn d3d9c7bb0a [SPARK-27079][MINOR][SQL] Fix typo & Remove useless imports & Add missing override annotation
## What changes were proposed in this pull request?

1. Fix two typos
2. Remove useless imports in `CSVExprUtils.scala`
3. Add missing `override` annotation

## How was this patch tested?

test by existing uts

Closes #24000 from WangGuangxin/SPARK-27079.

Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-08 12:14:04 -06:00
Ryan Blue 6170e40c15 [SPARK-24252][SQL] Add v2 catalog plugin system
## What changes were proposed in this pull request?

This adds a v2 API for adding new catalog plugins to Spark.

* Catalog implementations extend `CatalogPlugin` and are loaded via reflection, similar to data sources
* `Catalogs` loads and initializes catalogs using configuration from a `SQLConf`
* `CaseInsensitiveStringMap` is used to pass configuration to `CatalogPlugin` via `initialize`

Catalogs are configured by adding config properties starting with `spark.sql.catalog.(name)`. The name property must specify a class that implements `CatalogPlugin`. Other properties under the namespace (`spark.sql.catalog.(name).(prop)`) are passed to the provider during initialization along with the catalog name.

This replaces #21306, which will be implemented in two multiple parts: the catalog plugin system (this commit) and specific catalog APIs, like `TableCatalog`.

## How was this patch tested?

Added test suites for `CaseInsensitiveStringMap` and for catalog loading.

Closes #23915 from rdblue/SPARK-24252-add-v2-catalog-plugins.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-08 19:31:49 +08:00
Yuming Wang 2036074b99 [SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down
## What changes were proposed in this pull request?

[SPARK-24638](https://issues.apache.org/jira/browse/SPARK-24638) adds support for Parquet file `StartsWith` predicate push down.
`InMemoryTable` can also support this feature.

This is an example to explain how it works, Imagine that the `id` column stored as below:

Partition ID | lowerBound | upperBound
-- | -- | --
p1 | '1' | '9'
p2 | '10' | '19'
p3 | '20' | '29'
p4 | '30' | '39'
p5 | '40' | '49'

A filter ```df.filter($"id".startsWith("2"))``` or ```id like '2%'```
then we substr lowerBound and upperBound:

Partition ID | lowerBound.substr(0, Length("2")) | upperBound.substr(0, Length("2"))
-- | -- | --
p1 | '1' | '9'
p2 | '1' | '1'
p3 | '2' | '2'
p4 | '3' | '3'
p5 | '4' | '4'

We can see that we only need to read `p1` and `p3`.

## How was this patch tested?

 unit tests and benchmark tests

benchmark test result:
```
================================================================================================
Pushdown benchmark for StringStartsWith
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU  2.90GHz
StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                    12068 / 14198          1.3         767.3       1.0X
InMemoryTable Vectorized (Pushdown)           5457 / 8662          2.9         347.0       2.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU  2.90GHz
StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                      5246 / 5355          3.0         333.5       1.0X
InMemoryTable Vectorized (Pushdown)           2185 / 2346          7.2         138.9       2.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU  2.90GHz
StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                      5112 / 5312          3.1         325.0       1.0X
InMemoryTable Vectorized (Pushdown)           2292 / 2522          6.9         145.7       2.2X
```

Closes #23004 from wangyum/SPARK-26004.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-08 19:18:32 +08:00
Sean Owen 5ebb4b5723 [SPARK-24783][SQL] spark.sql.shuffle.partitions=0 should throw exception
## What changes were proposed in this pull request?

Throw an exception if spark.sql.shuffle.partitions=0
This takes over https://github.com/apache/spark/pull/23835

## How was this patch tested?

Existing tests.

Closes #24008 from srowen/SPARK-24783.2.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: WindCanDie <491237260@qq.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-08 14:09:53 +09:00
Jungtaek Lim (HeartSaVioR) d8f77e11a4 [SPARK-27001][SQL][FOLLOWUP] Address primitive array type for serializer
## What changes were proposed in this pull request?

This is follow-up PR which addresses review comment in PR for SPARK-27001:
https://github.com/apache/spark/pull/23908#discussion_r261511454

This patch proposes addressing primitive array type for serializer - instead of handling it to generic one, Spark now handles it efficiently as primitive array.

## How was this patch tested?

UT modified to include primitive array.

Closes #24015 from HeartSaVioR/SPARK-27001-FOLLOW-UP-java-primitive-array.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-08 11:54:04 +08:00
Yuming Wang 43dcb91a4c [SPARK-19678][FOLLOW-UP][SQL] Add behavior change test when table statistics are incorrect
## What changes were proposed in this pull request?

Since Spark 2.2.0 ([SPARK-19678](https://issues.apache.org/jira/browse/SPARK-19678)), the below SQL changed from `broadcast join` to `sort merge join`:
```sql
-- small external table with incorrect statistics
CREATE EXTERNAL TABLE t1(c1 int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'file:///tmp/t1'
TBLPROPERTIES (
'rawDataSize'='-1', 'numFiles'='0', 'totalSize'='0', 'COLUMN_STATS_ACCURATE'='false', 'numRows'='-1'
);

-- big table
CREATE TABLE t2 (c1 int)
LOCATION 'file:///tmp/t2'
TBLPROPERTIES (
'rawDataSize'='23437737', 'numFiles'='12222', 'totalSize'='333442230', 'COLUMN_STATS_ACCURATE'='false', 'numRows'='443442223'
);

explain SELECT t1.c1 FROM t1 INNER JOIN t2 ON t1.c1 = t2.c1;
```
This pr add a test case for this behavior change.

## How was this patch tested?

unit tests

Closes #24003 from wangyum/SPARK-19678.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-08 11:47:49 +08:00
Yuming Wang d70b6a39e1 [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
## What changes were proposed in this pull request?

This pr adds 2 maven properties to help us upgrade the built-in Hive.

| Property Name | Default | In future |
| ------ | ------ | ------ |
| hive.classifier | (none) | core |
| hive.parquet.group | com.twitter | org.apache.parquet |

## How was this patch tested?

existing tests

Closes #23996 from wangyum/add_2_maven_properties.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-07 16:46:07 -06:00
Shahid 713646ddc2 [SPARK-27075] Remove duplicate execution tag parameters from the url, when accessing the execution table in the SQL page
## What changes were proposed in this pull request?

When we sort any columns in the execution table of the SQL page in the WEBUI, it throws IllegalArgumentException. The root cause is that,  in the url, we are duplicating the execution tag parameters in the 'parameterPath'. Actually we should filter out the executionTag related entries while getting the 'parameterOtherTable'
e9e8bb33ef/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (L161-L163)
e9e8bb33ef/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (L241)
e9e8bb33ef/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (L263-L266)

## How was this patch tested?
Manually tested
Test steps:
Sort any column in the sql page execution table
Before fix:
![screenshot from 2019-03-07 01-38-17](https://user-images.githubusercontent.com/23054875/53913261-f0b69580-4080-11e9-88ea-f238b47a21d5.png)

After fix:
![screenshot from 2019-03-07 02-01-40](https://user-images.githubusercontent.com/23054875/53913285-01670b80-4081-11e9-81b6-78cdbf5a0817.png)

Closes #23994 from shahidki31/SPARK-27075.

Authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-07 12:52:46 -08:00
Gabor Somogyi 98a8725e66 [SPARK-27022][DSTREAMS] Add kafka delegation token support.
## What changes were proposed in this pull request?

It adds Kafka delegation token support for DStreams. Please be aware as Kafka native sink is not available for DStreams this PR contains delegation token usage only on consumer side.

What this PR contains:
* Usage of token through dynamic JAAS configuration
* `KafkaConfigUpdater` moved to `kafka-0-10-token-provider`
* `KafkaSecurityHelper` functionality moved into `KafkaTokenUtil`
* Documentation

## How was this patch tested?

Existing unit tests + on cluster.

Long running Kafka to file tests on 4 node cluster with randomly thrown artificial exceptions.

Test scenario:

* 4 node cluster
* Yarn
* Kafka broker version 2.1.0
* security.protocol = SASL_SSL
* sasl.mechanism = SCRAM-SHA-512

Kafka broker settings:

* delegation.token.expiry.time.ms=600000 (10 min)
* delegation.token.max.lifetime.ms=1200000 (20 min)
* delegation.token.expiry.check.interval.ms=300000 (5 min)

After each 7.5 minutes new delegation token obtained from Kafka broker (10 min * 0.75).
When token expired after 10 minutes (Spark obtains new one and doesn't renew the old), the brokers expiring thread comes after each 5 minutes (invalidates expired tokens) and artificial exception has been thrown inside the Spark application (such case Spark closes connection), then the latest delegation token picked up correctly.

cd docs/
SKIP_API=1 jekyll build
Manual webpage check.

Closes #23929 from gaborgsomogyi/SPARK-27022.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-07 11:36:37 -08:00
Bryan Cutler ddc2052ebd [SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF
## What changes were proposed in this pull request?

This change adds support for returning StructType from a scalar Pandas UDF, where the return value of the function is a pandas.DataFrame. Nested structs are not supported and an error will be raised, child types can be any other type currently supported.

## How was this patch tested?

Added additional unit tests to `test_pandas_udf_scalar`

Closes #23900 from BryanCutler/pyspark-support-scalar_udf-StructType-SPARK-23836.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-03-07 08:52:24 -08:00
Takeshi Yamamuro 315c95c399 [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats
## What changes were proposed in this pull request?
`CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not.

## How was this patch tested?
Pass Jenkins.

Closes #23947 from maropu/SPARK-21871-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-03-07 17:25:22 +09:00
Dilip Biswal a0e26cffc5 [MINOR][SQL][TEST] Include usage example for generating output for single test in SQLQueryTestSuite
## What changes were proposed in this pull request?
This is a very minor pr to include the usage example to generate output for single test in SQLQueryTestSuite. I tried to deduce it from the existing example and ran into a scenario
where sbt is simply looping to run the same test over and over again. Here is the example
of running a single test.

```
build/sbt "~sql/test-only *SQLQueryTestSuite -- -z inline-table.sql"
```
I tried to generate the output for a single test by prepending `SPARK_GENERATE_GOLDEN_FILES=1` like following
```
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "~sql/test-only *SQLQueryTestSuite -- -z describe.sql"
```
In this case i found that sbt is looping trying to run describe.sql over and over again as we are running the test in on continuous mode (because of `~` prefix ) where it detects a change in
the generated result file which in turn triggers a build and test. I have included an example where
we don't run it in continuous mode when generating the output. Hopefully it saves other developers some time.
## How was this patch tested?
Verified manually in my dev setup.

Closes #23995 from dilipbiswal/dkb_sqlquerytest_usage.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-07 13:06:23 +09:00
Gengliang Wang a543f917e0 [SPARK-27049][SQL] Create util class to support handling partition values in file source V2
## What changes were proposed in this pull request?

While I am migrating other data sources, I find that we should abstract the logic that:
1. converting safe `InternalRow`s into `UnsafeRow`s
2. appending partition values to the end of the result row if existed

This PR proposes to support handling partition values in file source v2 abstraction by adding a util class `PartitionReaderWithPartitionValues`.

## How was this patch tested?

Existing unit tests

Closes #23987 from gengliangwang/SPARK-27049.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-07 11:24:15 +08:00
Brooke Wenig 340c8b8387 [MINOR][DOC] Updated PySpark Binarizer docstring to match Scala's.
## What changes were proposed in this pull request?

PySpark's Binarizer docstring had two issues:
1) The values did not need to be in the range [0, 1].
2) It can be used for binary classification prediction.

This change corrects both of these issues by making it consistent with Scala's docstring for Binarizer.

## How was this patch tested?

Not applicable because I only changed the docstring. But if I need to do any testing, let me know and I'll do it.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23934 from brookewenig/binarizer-docs-fix.

Authored-by: Brooke Wenig <brookewenig@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-06 19:42:41 -06:00
Yuming Wang 32848eecc5 [SPARK-27078][SQL] Fix NoSuchFieldError when read Hive materialized views
## What changes were proposed in this pull request?

This pr fix `NoSuchFieldError` when reading Hive materialized views from Hive 2.3.4.

How to reproduce:
Hive side:
```sql
CREATE TABLE materialized_view_tbl (key INT);
CREATE MATERIALIZED VIEW view_1 DISABLE REWRITE AS SELECT * FROM materialized_view_tbl;
```
Spark side:
```java
bin/spark-sql --conf spark.sql.hive.metastore.version=2.3.4 --conf spark.sql.hive.metastore.jars=maven

spark-sql> select * from view_1;
19/03/05 19:55:37 ERROR SparkSQLDriver: Failed in [select * from view_1]
java.lang.NoSuchFieldError: INDEX_TABLE
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:438)
	at scala.Option.map(Option.scala:163)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:370)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:277)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:215)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:214)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:260)
	at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:368)
```

## How was this patch tested?

unit tests

Closes #23984 from wangyum/SPARK-24360.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-06 16:56:32 -08:00
Maxim Gekk 9513d82edd [SPARK-27057][SQL] Common trait for limit exec operators
## What changes were proposed in this pull request?

I would like to refactor `limit.scala` slightly and introduce common trait `LimitExec` for `CollectLimitExec` and `BaseLimitExec` (`LocalLimitExec` and `GlobalLimitExec`). This will allow to distinguish those operators from others, and to get the `limit` value without casting to concrete class.

## How was this patch tested?

by existing test suites.

Closes #23976 from MaxGekk/limit-exec.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-07 08:47:52 +08:00